Monday, 25 May 2015

Accounting for Data Mining Bias

I've recently subscribed to this forexfactory thread, which is about using machine learning to develop trading systems, and the subject of data mining/data dredging has come up. This post is a short description of how mining/dredging can be accounted for, but readers should be aware that the following is not a precise description of any particular test with accompanying code, but rather a hand wavy description of a general but important principle.

Suppose one has conducted a whole series of tests on a particular set of data with a view to developing a trading system. The precise nature of this is not really important - it could be some machine learning approach, a grid search of moving average crossover parameter values, a series of elimination contests to find the "best" indicator, or whatever. While doing this we keep a record of all our results and when the search is complete we plot a histogram thus:-
which is the result of 160,000 distinct tests plotted in 200 bins. Naturally, having done this, we select the best system found, represented by the vertical cursor line at x-axis value 5.2. This 5.2 is our test metric of choice, be it Sharpe ratio, win to loss ratio, whatever. But then we ask ourselves whether we have truly found a world beating system or is this discovery the result of data mining?

To test this, we create a random set of data which has the same attributes as the real data used above. The random data can be obtained by Bootstrapping, random permutation, application of a Markov chain with state spaces derived from the original data etc. The actual choice of which to use will depend on the null hypothesis one wants to test. Having obtained our random data set, we then perform the exact same search as we did above and record the test metric of best performing system found on this random data set. We repeat this 160,000 times and then plot a histogram ( in red ) of the best test results over all these random data sets:-
We find that this random set has a mean value of 0.5 and a standard deviation of 0.2. What this red test set represents is the ability/power of our machine learning algo, grid search criteria etc. to uncover "good" systems in even meaningless data, where all relationships are, in effect, spurious and contain no predictive ability.

We must now suppose that this power to uncover spurious relationships also exists in our original set of tests on the real data, and it must be accounted for. For purposes of illustration I'm going to take a naive approach and take 4 times the standard deviation plus the mean of the red distribution and shift our original green distribution to the right by an amount equal to this sum, a value of 1.3 thus:-
We now see that our original test metric value of 5.2, which was well out in the tail of the non-shifted green distribution, is comfortably within the tail of the shifted distribution, and depending on our choice of p-value etc. we may not be able to reject our null hypothesis, whatever it may have been.

As I warned readers above, this is not supposed to be a mathematically rigorous exposition of how to account for data mining bias, but rather an illustrative explanation of the principle(s) behind accounting for it. The main take away is that the red distribution, whatever it is for the test(s) you are running, needs to be generated and then the tests on real data need to be appropriately discounted by the relevant measures taken from the red distribution before any inferences are drawn about the efficacy of the results on the real data.

For more information about data mining tests readers might care to visit a Github repository I have created, which contains code and some academic papers on the subject. 


Michael Harris said...

Interesting blog about a real problem.

"Having obtained our random data set, we the[n] perform the exact same search as we did above"

With machine learning it's rare to have an "exact same search". Based on this fact, the method is a first good effort but must take that into account. For example, you can do 1000 searches and make a distribution of the metric for the best rule. Then shift the original distribution. However, I'm afraid that when this is done no machine learning method will turn out significant results.

Good blog. Michael Harris

Anonymous said...

You say "take 4 times the standard deviation plus the mean of the red distribution and shift our original green distribution to the right by an amount equal to this". I don't get this? Why shift the green distribution? I would have thought that the test had already shown the red had a much smaller density than the green in the right tip are of the green, and therefore a system in that area was much more likely to have come from an influence present in the green but not the red. I'm really missing something ... Jonathan

Scott W said...

Nice work

I see the value of the first couple of steps, nice work, thanks.

I am not convinced -- perhaps you could add a more detail explanation -- of the theory, logic and math that allowed you to conclude that your original results was safely real, and not spurious. I am less concerned about the exact calculation - which is likely going to have a lot of leaps, vs a rigorous sequence of logical assumptions.

Kind Regards,

Dekalog said...

@Anonymous - the number for 4x the sd was just made up, as was the idea of shifting the green distribution. You are quite right that if the null hypothesis being tested is whether the "best" is just a result of data mining bias then the shifting of the green distribution is irrelevant - this best should be compared directly to the red distribution. I did say that the test I was describing was "naive" and perhaps I should have explained that more clearly in the main blog post. The main point I was trying to make is that the red distribution should be created and then used in some way. The method of creating this distribution and how it will be used is very specific to the null hypothesis being tested. In due course, when I do these tests for myself on my results, I shall post more details.

@Scott - see my reply to anonymous above. There was no intent to accurately describe any particular hypothesis test - all the values etc. were simply made up to illustrate a general principle that one should try to account for the data mining bias in some way.

Krzysztof F said...

Great stuff !!! More practical info about permutation tests is here

However permutation methods are applied to out of sample trades and your method to in sample or ??


Dekalog said...

@ Krzysztof Thanks for the link.