Making Measurements of Risk
One of the challenges I’ve experienced in testing a variety of systems is determining whether one system is “better” than the other. Obviously there are questions of timeframe, risk objective, and style, but even having narrowed it down to those specifics, there were often times that I had to choose between two or more systems. Just recently I have finalized a system to track, but it was only after exploring a variety of metrics to make the decision between two choices. It occurs to me that walking through that decision might make a worthwhile explanation of what risk and return metrics I find useful in developing systems.
I have two trading plans with nine years of backtested results. The first plan, Plan A, has a standard deviation (StDev) of annual returns equal to 45.1%, while Plan B, the second plan, has a standard deviation of annual returns equal to 27.9%. Plan A looks a lot riskier! Ah, but what are the average returns in relation to the variability? Is the extra risk of Plan A versus Plan B worth it, given the returns that Plan A and Plan B generate?
Average Annual Gain (AAG) is just the gain for each year, summed, and divided by the number of years.
Plan A has an average annual gain of 43.8%.
Plan B has an average annual gain of 36.2%.
CAGR is Cumulative Annualized Growth Rate. Equity at the end is divided by equity at the beginning, and some exponential math is done to see what kind of annualized growth rate would be needed to go from beginning to end in the number of years observed. Usually this number is lower than the average annual gain, because it takes effort to recover from losses. A 10% loss demands an 11.1% gain to recover, 20% loss demands a 25% gain, etc.
Plan A has a CAGR of 39.0%.
Plan B has a CAGR of 34.2%.
Is Plan A’s larger return worth the “risk” in my opinion?
One measure of risk-adjusted return is the CAGR/StDev.
Plan A has a CAGR/StDev of 0.865.
Plan B has a CAGR/StDev of 1.227.
In terms of this measure, Plan B is better. However, the CAGR/StDev isn’t always the best measurement. Imagine investing in T-Bills or CDs; the annual return on cash isn’t very high, but the “risk” is negligible. When a return in the low single-digits is divided by a standard deviation that doesn’t move much, the CAGR/StDev goes through the roof! So while I like the CAGR/StDev measurement, I know that I need some backdrop of target returns that need to be exceeded in order for me to use it, or another measurement altogether, that takes into account some benchmark. In this case, with CAGRs well above 30%, it may be appropriate by itself.
I talked twice before about the meaninglessness of Alpha and Beta outside of their regression equations, but now it’s time to really put them in their proper place. To measure these, as well as to measure the Sharpe ratio, I need definitions for a Risk Free Rate of Return (RF), Excess Return (ER), and a Benchmark.
The benchmark is one place where analysts fall into trouble. The right benchmark for a U.S.-traded stock plan is probably some index of stocks traded in the U.S., maybe the S&P 500, maybe the NYSE composite, who knows. The right benchmark for global macro trading plans is probably some global asset allocation scheme, the benchmark for merger arbitrage hedge funds should be some index of merger arbitrage funds, etc. In other words, the benchmark should be a rather inclusive subset or index of the universe of securities that the trading plan samples from. Improper benchmarking can make a plan look better than it is! For my purposes, both Plan A and Plan B are based on U.S.-traded stocks and are benchmarked to the S&P 500.
The Risk Free Rate of Return (RF) is just the return on cash. For giggles, I’m using 4.5% annually.
Excess Return (ER) is the return of a plan that exceeds what one could get from cash. I take each year’s return and subtract 4.5% to obtain the ER for that year, for that plan. I also need to do this for the Benchmark, in this case, the S&P 500.
For each plan to be evaluated, I need to do a simple linear regression, with the Excess Returns (ER) of the Benchmark as the X variable, and the ER of the plan as the Y variable. The slope of the regression line is Beta. Y-axis intercept, or the value of the regression line hits when the ER of the Benchmark is zero, is Alpha.
Outside of the above definition, Alpha and Beta are meaningless. They are a tool for evaluating returns generated by a money manager or by a system, nothing more, nothing less.
Plan A has Alpha of 34.4% and Beta of 1.770.
Plan B has Alpha of 30.1% and Beta of 0.582.
Neither of these plans has any judgment applied in the selection of stocks or in the timing of entry and exit, i.e., they are both completely mechanical in nature. The significance of the Alpha and Beta measurements is simply that both have a wide amount of outperformance relative to the benchmark, but Plan A, with its higher Beta, tends to outperform even more when the S&P 500 is having a good year, and tends to have a lot less performance when the S&P 500 is having a bad year.
The Sharpe ratio is another tool for measuring risk-adjusted returns. The calculation is pretty simple, just take the straight average of several years’ Excess Return (ER) and divide it by the standard deviation (StDev) of the Excess Return (ER). It’s very similar to the CAGR/StDev ratio, except that Sharpe takes into account the Risk Free Rate of Return (RF) as a benchmark.
What does the Sharpe ratio tell me at some particular Risk Free Rate of Return (RF)? In terms of the statistic itself, it’s really a Z-Test for how likely it is that the observed returns are significantly different from Risk Free. Higher numbers are better. Another way of looking it is by saying that, if my Sharpe is high at a certain Risk Free Rate of Return (RF), then I could borrow tons of money from my broker at that rate and execute this plan profitably with no worries at all! Rrrriiiigggghtt! Tell me how well that works at 20:1 leverage, OK?
Plan A has Sharpe of 0.871 at RF = 4.5%.
Plan B has Sharpe of 1.139 at RF = 4.5%.
Now, is the Risk Free Rate of Return (RF) an appropriate benchmark? I mean, would I ever seriously consider holding cash as a viable alternative in this world of fiat currency inflation? Well, maybe not, but I could always use some other benchmark, say a targeted minimum return annually instead of the RF to determine the scores of my plans. Suppose I wanted to compound at 2% monthly. My target annual return is then 26.8%.
Plan A has Sharpe of 0.377 at Target = 26.8%.
Plan B has Sharpe of 0.338 at Target = 26.8%.
Drawdown (DD) occurs when the equity of a system falls below where it once was. I think of an index making a new high and then a correction, and the period of time where the index is below a prior high is the drawdown (DD). Drawdowns have magnitude (how much equity is lost) and duration, and a system or index could spend much of its life in a drawdown. Not counting dividends, the S&P 500 was in drawdown from 2000 until 2007, and at one point was down almost 50% from its high!
Plan A has a maximum DD of 22.9% with 56.0% of months spent in DD.
Plan B has a maximum DD of 20.3% with 43.1% of months spent in DD.
Yet another methodology would be to take the CAGRs generated above, and divide them by the maximum drawdowns.
Plan A has a CAGR/DD of 1.700.
Plan B has a CAGR/DD of 1.684.
I know me, and inevitably, I am going to examine the returns of any system I use on a short-term duration. It’s nice to have a system that consistently generates positive returns and beats the S&P 500 on a month-to-month basis.
Plan A beats the index 62.9% of the time, with 68.1% of months positive.
Plan B beats the index 66.4% of the time, with 71.6% of months positive.
Most of the above statistics are compiled with non-overlapping full years of returns, but there’s something else worth looking at, in my opinion. For every possible starting point of executing the system, what is the best 12-month result in testing, and what is the worst result?
Plan A had a best of +171.6% and a worst of -14.1%.
Plan B had a best of +99.0% and a worst of -11.7%.
It’s interesting to note that none of the above metrics really give me a definitive answer to my question, that is, is the extra risk of Plan A versus Plan B worth it, given the higher returns that Plan A generates? In my book, I tend towards the simpler metrics. I also tend to see testing metrics that are relatively close as being approximately equal. Thus, I will present my “scorecard” for debate, knowing full well that it will be debated by my readers.
I think that Plan A has a strong lead in AAG, but not so much in the CAGR category, where I think that Plan A has a very slight lead. If I had used the last seven years of backtested results to evaluate the systems, instead of the last nine years, Plan B would have the lead in both measures. Given the inherent randomness of “only” nine years, I don’t count this difference as being very large, but I really like the high numbers generated. It’s interesting to note that CAGR is a measure of risk-adjusted return when compared to AAG, since the amount of gain it takes to recover from losses results in CAGR being lower than AAG somewhat in proportion to the StDev of returns.
I like the advantage of Plan B in the CAGR/StDev metric, and the CAGR is large enough that I don’t feel the need to adjust for a benchmark, so I discard Alpha, Beta, and Sharpe in my analysis. I recognize that Beta and Sharpe are excellent “selling points” if an institution were examining these plans, but I also think they’d be much more interested in how much money could be run through them – which is another discussion entirely!
I consider the Max DDs as equal, but like the fact that Plan B spent fewer months in DD.
I think that Plan B has a slight edge in terms of consistently eking out positive months and months that outperform the index. If I were to ever need a plan to trade for a living, that is, make constant small withdrawals from trading equity in order to pay bills, etc., I would favor consistency, if long-term returns were equal.
Plan B has an edge in its “worst 12 months” being better than Plan A’s. This comes at the expense of forgoing the windfall profits of Plan A’s best year.
For my money, I consider Plan B to be superior to Plan A in risk-adjusted return, based on the backtested results.
There is one, usually unmentioned, assumption in analyzing returns. Generally it is assumed that the sample of returns the system came from is somehow indicative of the possible future returns. The standard disclaimer applies, past results are no guarantee of future returns, etc., blah blah blah, yada yada yada. I would be suspect of returns from only a bull market or a bear market, but these systems have nine years of data from 1997 to 2007 and cycle through 100+ round trips per year, holding 20 positions at a time. Is that good enough? I dunno. Backtesting is also prone to errors in the data files used, in entry/exit and expense assumptions, I don’t have dividends in the returns, there are a host of possible errors involved. The point isn’t really about Plan A or Plan B, even though both exist, and I will be tracking one of them at my new site, The Rempel Report at billrempel.com. No, the point is that there are lots of different ways to evaluate returns, and I felt that my readers might want to see them explained in the context in which I use them.
As regular readers know, I’ve moved my personal trades and the tracking of specific, actionable trading plans to the new site, The Rempel Report at billrempel.com. The new site won’t link out, except in the context of a post’s content, and is reserved for actionable trading items only! I expect to make 2-3 posts a week there, primarily on the weekends, unless something drastic happens in the markets during the week. Please visit, and if you like what you see, you can register to receive email updates with each new post, leave comments, and join the discussion!


November 19th, 2007 at 8:31 pm
You neglect to mention if your estimates of alpha and beta are significant. I’m guessing that with your yearly data at least the alpha will be insignificant. With monthly data (108 vs. 9 observations), you might get better results and at least you wouldn’t be violating the normality assumption in linear regression.
November 19th, 2007 at 8:57 pm
Thanks for reading and thanks for commenting, John!
With nine observations, they most likely wouldn’t be. I didn’t bother testing for the significance of the Alpha and Beta, mostly because I am not a fan of the measures, partially because they’ve been so bloviated about philosophically, and partially because I don’t find them relevant. I present them for discussion - I don’t use them in my own evaluations.
From what I’ve gathered, the convention is to use annual returns for Alpha, Beta, and Sharpe calculations. It’s hard to tell, because most presentations and presenters are pretty sloppy in terms of listing unit times in their measures. I prefer monthly data for most measures! CAGR, trimmed mean, or median monthly returns in relation to StDev of monthly returns is a great first screen for systems, as is a monthly Sharpe with a target return, or the SPX return if it’s a stock screen, instead of the risk free rate. Percent positive and percent beat a benchmark are good monthly measures of consistency IMO, especially when choosing between roughly equal returns.
November 20th, 2007 at 4:48 pm
Good post Bill.
November 20th, 2007 at 7:29 pm
Glad you like it, hope it’s useful!
November 22nd, 2007 at 1:45 pm
I did go back and re-read this. I see what you mean and have to agree that B would be my choice also. My psychological makeup will give on the profit side for the lower DD components. I also agree that the time period is very important. Have you tried to vary the initial conditions to evaluate a best/worst period performance? Say pick the seven years that encompass a bull period at the start and end and make up the majority of the time period? And maybe start with a bear period and end with a bear period with the bull in the middle? I know that rainfall patterns sure do change the results in drainage systems. Vary the hyetograph and you get totally different performance with the same amount of rainfall. Vary whether you run the storm up the basin vs down and the same thing, totally different results. As with every indeterminant system, the initial conditions will invariably rule the results. Thanks for the pointer to this commentary, it made me think. I can see that for me, drawdowns are potentially the cause of biggest loss of sleep if I don’t get a handle on it.
November 27th, 2007 at 10:15 pm
[…]Last week, I posted making measurements of risk, which was just a walk through some metrics that I’ve been using when examining output from my backtesting ideas. […]
November 28th, 2007 at 11:05 am
I know you can’t include every risk measure out there, but is there any reason you didn’t include the Sortino Ratio?
November 28th, 2007 at 6:05 pm
I didn’t mention the Sortino ratio in this post because I haven’t used it yet, and I was sticking with stuff I had done. I did come across his site, Sortino.com, in some reading this week, however, and liked most of the concepts.
A target return, termed minimum acceptable return (MAR), is used in a formula similar to the Sharpe ratio’s formula. I actually worked an example with this in the post, the one with 2% compounded monthly as a goal. However, in the Sortino ratio, only the downside variance is considered in the measurement of risk. I had read about this concept before knowing it had originated with (or been credited to) Sortino, but I haven’t done any work with it yet.
I am whole-heartedly in agreement that these are two good concepts, even though I’ve only gotten around to playing with one of them so far.
[Rant Alert]
More recently, according to his site, it seems that Sortino is performing style analysis on a manager, and then using a long period of “style returns” (blended for that manager’s style blend) to do a bootstrap analysis, for determining the risk of that manager. I don’t like this benchmarking concept for all purposes. In the evaluation of a low-turnover U.S.-based equity fund, it probably makes sense, as long as the fund has a fairly consistent style. “Style drift” could confound it, i.e., what’s the right style blend to bootstrap? Today’s style, the average of past styles in the evaluation period, some estimate of future styles? What’s the style of a merger arb strategy? Or a stat arb strategy? Or a carry trade strategy? Hmm?
Some observers have suggested finding (probably through regression analysis) the blend of equity styles that is closest to the strategy in question. It occurs to me that there might not be a good fit at all for some systems. The “your style is what it is equivalent to” answer is more than a bit unsatisfying to me.
April 4th, 2008 at 6:06 pm
[…] a more numerous basket of futures to trade simultaneously, meaning smoother returns and higher Sharpe et al. Booyah! If their macro strategy involves moving a market, perhaps on the order of Soros vs. BOE, […]