Dekalog Blog: Data Mining

Showing posts with label Data Mining. Show all posts

Wednesday, 24 June 2015

Results of Permutation tests on Cauchy-Schwarz

Following on from my previous post, I have to say that the results have been quite disappointing - in all the tests I have conducted so far I have been unable to reject the null hypothesis. These tests are still on-going and I may yet find some gold, but it is looking increasingly unlikely and so I won't bore readers with the results of these negative tests. Unless something drastically changes I am planning on abandoning the Cauchy-Schwarz matching algorithm, at least in its current form.

For those who are interested, the test I am conducting is the data mining bias adjusted random permutation test, which is the position_vector_permutation_test.cc file on my Data Snooping Tests page on Github.

Soon I shall be going away for the summer and although I will continue working on a borrowed laptop, I am not sure what internet access I will have and so this post may be my last substantive post until some time in September. I am taking a lot of reading with me, all my Octave code and data and I have loaded up my R installation with lots of interesting packages to play around with some new ideas, so hopefully there will be some interesting new things to post about in the autumn.

Monday, 25 May 2015

Accounting for Data Mining Bias

I've recently subscribed to this forexfactory thread, which is about using machine learning to develop trading systems, and the subject of data mining/data dredging has come up. This post is a short description of how mining/dredging can be accounted for, but readers should be aware that the following is not a precise description of any particular test with accompanying code, but rather a hand wavy description of a general but important principle.

Suppose one has conducted a whole series of tests on a particular set of data with a view to developing a trading system. The precise nature of this is not really important - it could be some machine learning approach, a grid search of moving average crossover parameter values, a series of elimination contests to find the "best" indicator, or whatever. While doing this we keep a record of all our results and when the search is complete we plot a histogram thus:-

which is the result of 160,000 distinct tests plotted in 200 bins. Naturally, having done this, we select the best system found, represented by the vertical cursor line at x-axis value 5.2. This 5.2 is our test metric of choice, be it Sharpe ratio, win to loss ratio, whatever. But then we ask ourselves whether we have truly found a world beating system or is this discovery the result of data mining?

To test this, we create a random set of data which has the same attributes as the real data used above. The random data can be obtained by Bootstrapping, random permutation, application of a Markov chain with state spaces derived from the original data etc. The actual choice of which to use will depend on the null hypothesis one wants to test. Having obtained our random data set, we then perform the exact same search as we did above and record the test metric of best performing system found on this random data set. We repeat this 160,000 times and then plot a histogram ( in red ) of the best test results over all these random data sets:-

We find that this random set has a mean value of 0.5 and a standard deviation of 0.2. What this red test set represents is the ability/power of our machine learning algo, grid search criteria etc. to uncover "good" systems in even meaningless data, where all relationships are, in effect, spurious and contain no predictive ability.

We must now suppose that this power to uncover spurious relationships also exists in our original set of tests on the real data, and it must be accounted for. For purposes of illustration I'm going to take a naive approach and take 4 times the standard deviation plus the mean of the red distribution and shift our original green distribution to the right by an amount equal to this sum, a value of 1.3 thus:-

We now see that our original test metric value of 5.2, which was well out in the tail of the non-shifted green distribution, is comfortably within the tail of the shifted distribution, and depending on our choice of p-value etc. we may not be able to reject our null hypothesis, whatever it may have been.

As I warned readers above, this is not supposed to be a mathematically rigorous exposition of how to account for data mining bias, but rather an illustrative explanation of the principle(s) behind accounting for it. The main take away is that the red distribution, whatever it is for the test(s) you are running, needs to be generated and then the tests on real data need to be appropriately discounted by the relevant measures taken from the red distribution before any inferences are drawn about the efficacy of the results on the real data.

For more information about data mining tests readers might care to visit a Github repository I have created, which contains code and some academic papers on the subject.

Thursday, 11 December 2014

MFE/MAE Indicator Test Results

Following on from the previous post, the test I outlined in that post wasn't very satisfactory, which I put down to the fact that the Sigmoid transformation of the raw MFE/MAE indicator values is not amenable to the application of standard deviation as a meaningful measure. Instead, I changed the test to one based on the standard error of the mean, an example screen shot of which is shown below:-

The top pane shows the the long version of the indicator and the bottom pane the short version. In each there are upper and lower limits of the sample standard error of the mean above and below the population mean (mean of all values of the indicator) along with the cumulative mean value of the top N matches as shown on the x-axis. In this particular example it can be seen that around the 170-180 samples mark the cumulative mean moves inside the standard error limits, never to leave them again. The meaning I ascribe to this is that there is no value to be gained from using more than approximately 180 samples for machine learning purposes, for this example, as to use more samples would be akin to training on all available data, which makes the use of my Cauchy Schwarz matching algo superfluous. I repeated the above on all instances of sigmoid transformed and untransformed MFE/MAE indicator values to get an average of 325 samples for transformed, and an average of 446 samples for the untransformed indicator values across the 4 major forex pairs. Based on this, I have decided to use the top 450 Cauchy Schwarz matches for training purposes, which has ramifications for model complexity will be discussed shortly.

Returning to the above screen shot, the figure 2 inset shows the price bars that immediately follow the price bar for which the main screen shows the top N matches. Looking at the extreme left of the main screen it can be seen that the lower pane, short indicator has an almost maximum reading of 1 whilst the upper pane, long indicator shows a value of approx. 2.7, which is not much above the global minimum for this indicator and well below the 0.5 neutral level. This strongly suggests a short position, and looking at the inset figure it can be seen that over the 3 days following the extreme left matched bar a short position was indeed the best position to hold. This is a pattern that seems to frequently present itself during visual inspection of charts, although I am unable to quantify this in any way.

On the matter of model complexity alluded to above, I found the Learning From Data course I have recently completed on the edX platform to be very enlightening, particularly the concept of the VC dimension, which is nicely explained in the Learning From Data Video library. I'll leave it to interested readers to follow the links, but the big take away for me is that using 450 samples as described above implies that my final machine learning model must have an upper bound of approximately 45 on the VC dimension, which in turn implies a maximum of 45 weights in the neural net. This is a design constraint that I will discuss in a future post.

Sunday, 25 May 2014

Updated Cauchy-Schwarz Matching Algorithm

Following on from my previous post, below is a code box showing a slightly improved Cauchy-Schwarz matching algorithm, improved in the sense that this implementation has a slightly better effect size over random when the test runs of the previous post's version are compared with this version.

function [ top_matches ] = rolling_cauchy_schwarz_matching_algo_2( open_ch, high_ch, low_ch, close_ch, period )

% pre-allocate vectors in memory
cauchy_schwarz_values = zeros( size(close_ch,1) , 1 ) ;
top_matches = zeros( size(close_ch,1), 100 ) ;

% select price bar to train nn on
for jj = size(close_ch,1)-250 : size(close_ch,1)

lookback = period( jj ) ;
sample_to_match = [ close_ch( jj-lookback : jj )' high_ch( jj-9 : jj )' low_ch( jj-9 : jj )' ( close_ch( jj-4 : jj ).-open_ch( jj-4 : jj ) )' ] ;
norm_sample_to_match = norm( sample_to_match ) ;

% for this jj train_bar, calculate cauchy_schwarz matching values in the historical record up to index jj-2
for ii = 50 : jj - 2
cauchy_schwarz_values(ii) = abs( sample_to_match * [ close_ch( ii-lookback : ii ) ; high_ch( ii-9 : ii ) ; low_ch( ii-9 : ii ) ; ( close_ch( ii-4 : ii ).-open_ch( ii-4 : ii ) ) ] ) / ( norm_sample_to_match * norm( [ close_ch( ii-lookback : ii , 1 ) ; high_ch( ii-9 : ii ) ; low_ch( ii-9 : ii ) ; ( close_ch( ii-4 : ii ).-open_ch( ii-4 : ii ) ) ] ) ) ;
end % end of ii loop

% get the top 100 matches for this price bar
[ s, sort_index ] = sort( cauchy_schwarz_values ) ;
top_matches( jj, : ) = sort_index( end-99 : end )' ;

end % end of jj loop

end % end of function

The inputs are channel normalised prices, with the length of the channel being adaptive to the dominant cycle period. This function is called as part of a rolling neural net training regime to select the top n (n = 100 in this case) matches in the historical record as training data. The actual NN training code is a close adaptation of the code in my neural net walkforward training post, but with a couple of important caveats which are discussed below.

Firstly, when training a feedforward neural network it is normal that a certain number of samples are held out of the training set for use as a cross validation set. The point of this is to ensure that the trained NN will generalise well to as yet unseen data. In the case of my rolling training regime this does not apply. The NN that is being trained for the "current bar" will be used once to classify the "current bar" and then thrown away. The "next bar" will have a completely new NN trained specifically for it, which in its turn will be discarded, and so on and so on along the whole price history. There is no need to ensure generalisation of any specifically trained NN. This being the case, all the training set examples are used in the training and early stopping is implemented by a crude heuristic of classification accuracy on the training set: training stops when the classification error rate on the whole training set is <= 5%. Further experience with this in the future may lead me to make some adjustments, but for now this is what I am going with.

A second reason for adopting this approach stems from my reading of this book wherein it is stated that on financial time series the "traditional" machine learning error metrics can be misleading. It cites a (theoretical?) example of a profitable trading system that has been trained/optimised for maximum profit but has a counter-intuitive, negative R-squared. The explanation for this lies in the heavy tails of price distribution(s). It is in these tails that the extreme returns reside and where the big profits/losses are to be made. However, by using a more traditional error metric such as least squares a ML algorithm might concentrate on the central area of a price distribution in order to reduce the error metric on the majority of price instances and thereby ignore the tails, producing a nice, low error but a useless system. The converse can be true for a good system, in that the ML least squares metric can be rubbish but the relevant performance metric (max profit, min draw down, risk adjusted return etc.) of the system great.

It is for these reasons that I have adopted my current approach.

Thursday, 3 January 2013

The Coin Toss Experiment

" 'The coin toss experiment' provides an indication that when one comes across a process that generates many system alternatives with many equity curves, some acceptable and some unacceptable, one may get fooled by randomness. Minimizing data-mining and selection bias is a very involved process for the most part outside the capabilities of the average user of such processes," taken from a recent addition to the blogroll. Interesting!

Pages