Dekalog Blog

Wednesday, 5 December 2012

Neural Net Market Classifier to Replace Bayesian Market Classifier

I have now completed the cross validation test I wanted to run, which compares my current Bayesian classifier with the recently retrained "reserve neural net," the results of which are shown in the code box below. The test consists of 50,000 random iterations of my usual "ideal" 5 market types with the market classifications from both of the above classifiers being compared with the actual, known market type. There are 2 points of comparison in each iteration: the last price bar in the sequence, identified as "End," and a randomly picked price bar from the 4 immediately preceding the last bar, identified as "Random."

Number of times to loop: 50000
Elapsed time is 1804.46 seconds.

Random NN
Complete Accuracy percentage: 50.354000

"Acceptable" Mis-classifications percentages 
Predicted = uwr & actual = unr: 1.288000
Predicted = unr & actual = uwr: 6.950000
Predicted = dwr & actual = dnr: 1.268000
Predicted = dnr & actual = dwr: 6.668000
Predicted = uwr & actual = cyc: 3.750000
Predicted = dwr & actual = cyc: 6.668000
Predicted = cyc & actual = uwr: 2.242000
Predicted = cyc & actual = dwr: 2.032000

Dubious, difficult to trade mis-classification percentages 
Predicted = uwr & actual = dwr: 2.140000
Predicted = unr & actual = dwr: 2.140000
Predicted = dwr & actual = uwr: 2.500000
Predicted = dnr & actual = uwr: 2.500000

Completely wrong classifications percentages 
Predicted = unr & actual = dnr: 0.838000
Predicted = dnr & actual = unr: 0.716000

End NN
Complete Accuracy percentage: 48.280000

"Acceptable" Mis-classifications percentages 
Predicted = uwr & actual = unr: 1.248000
Predicted = unr & actual = uwr: 7.630000
Predicted = dwr & actual = dnr: 0.990000
Predicted = dnr & actual = dwr: 7.392000
Predicted = uwr & actual = cyc: 3.634000
Predicted = dwr & actual = cyc: 7.392000
Predicted = cyc & actual = uwr: 1.974000
Predicted = cyc & actual = dwr: 1.718000

Dubious, difficult to trade mis-classification percentages 
Predicted = uwr & actual = dwr: 2.170000
Predicted = unr & actual = dwr: 2.170000
Predicted = dwr & actual = uwr: 2.578000
Predicted = dnr & actual = uwr: 2.578000

Completely wrong classifications percentages 
Predicted = unr & actual = dnr: 1.050000
Predicted = dnr & actual = unr: 0.886000

Random Bayes
Complete Accuracy percentage: 19.450000

"Acceptable" Mis-classifications percentages 
Predicted = uwr & actual = unr: 7.554000
Predicted = unr & actual = uwr: 2.902000
Predicted = dwr & actual = dnr: 7.488000
Predicted = dnr & actual = dwr: 2.712000
Predicted = uwr & actual = cyc: 5.278000
Predicted = dwr & actual = cyc: 2.712000
Predicted = cyc & actual = uwr: 0.000000
Predicted = cyc & actual = dwr: 0.000000

Dubious, difficult to trade mis-classification percentages 
Predicted = uwr & actual = dwr: 5.730000
Predicted = unr & actual = dwr: 5.730000
Predicted = dwr & actual = uwr: 5.642000
Predicted = dnr & actual = uwr: 5.642000

Completely wrong classifications percentages 
Predicted = unr & actual = dnr: 0.162000
Predicted = dnr & actual = unr: 0.128000

End Bayes
Complete Accuracy percentage: 24.212000

"Acceptable" Mis-classifications percentages 
Predicted = uwr & actual = unr: 8.400000
Predicted = unr & actual = uwr: 2.236000
Predicted = dwr & actual = dnr: 7.866000
Predicted = dnr & actual = dwr: 1.960000
Predicted = uwr & actual = cyc: 6.142000
Predicted = dwr & actual = cyc: 1.960000
Predicted = cyc & actual = uwr: 0.000000
Predicted = cyc & actual = dwr: 0.000000

Dubious, difficult to trade mis-classification percentages 
Predicted = uwr & actual = dwr: 5.110000
Predicted = unr & actual = dwr: 5.110000
Predicted = dwr & actual = uwr: 4.842000
Predicted = dnr & actual = uwr: 4.842000

Completely wrong classifications percentages 
Predicted = unr & actual = dnr: 0.048000
Predicted = dnr & actual = unr: 0.040000

A Quick Analysis

Looking at the figures for complete accuracy it can be seen that the Bayesian classifier is not much better than randomly guessing, with 19.45% and 24.21% for "Random Bayes" and "End Bayes" respectively. The corresponding accuracy figures for the NN are 50.35% and 48.28%.
In the "dubious, difficult to trade" mis-classification category Bayes gets approx. 22% and 20% this wrong, whilst for the NN these figures halve to approx. 9.5% and 9.5%.
In the "acceptable" mis-classification category Bayes gets approx. 29% and 29%, with the NN being more or less the same.

Although this is not a completely rigorous test, I am satisfied that the NN has shown its superiority over the Bayesian classifier. Also, I believe that there is significant scope to improve the NN even more by adding additional features, changes in architecture and use of the Softmax unit etc. As a result, I have decided to gracefully retire the Bayesian classifier and deploy the NN classifier in its place.

Wednesday, 28 November 2012

Geoff Hinton's Coursera Course Almost Ended

I am now in the final week of the course (see previous post) and just have the final exam to complete. The course has been very intensive, very interesting and much more difficult than the first machine learning course I took. Personally, the big take aways from this course for the things that I want to do are:

Softmax activation function for output layers. I intend to replace my current use of the Sigmoid function in the output layer of my standby neural net with this Softmax function. The Softmax is far more suitable for my intended classification purposes.
Octave code for using momentum to speed up the training of a neural net.
Restricted Boltzmann machines, the stacking thereof and deep learning, and unsupervised learning. I shall talk more about this in a future post.

With regard to the training of my standby neural net, it is mostly completed. I say mostly because as soon as I learned about the above mentioned items I stopped training it once I had trained it on sufficient data to cover almost 99% of the dominant cycle periods to be found in the data. It seemed pointless to continue training it with increasing training times and diminishing returns, particularly since it is destined to be remodelled and retrained using what I've just learned. For now I will subject it to cross validation testing and if it passes this, I shall deploy it for a short period until such time as it is replaced by the neural net I have in mind following on from the course.

Thursday, 4 October 2012

Change in Neural Net Training Plans

After having spent the last few days training my NNs and seeing how long it is taking on my new data I have decided to change my training plans. I had been simultaneously training (on two separate computers) my decision tree idea alongside a more "normal" multi-class NN in the hope of eventually comparing the two. However, I anticipate that if I continued with this two pronged approach it would take about a month to finish, and I'd like quicker results than that. Also my attempt to use the hyperbolic tangent activation function hasn't been too successful and I'm not sure whether it's my coding or some deeper theoretical reason why it isn't working satisfactorily. Another reason is that the Coursera Neural Nets for Machine Learning course has just started, the syllabus for which is shown below:-

Lecture 1: Introduction
Lecture 2: The Perceptron learning procedure
Lecture 3: The backpropagation learning procedure
Lecture 4: Learning feature vectors for words
Lecture 5: Object recognition with neural nets
Lecture 6: Optimisation: How to make the learning go faster
Lecture 7: Recurrent neural networks and advanced optimisation
Lecture 8: How to make neural networks generalise better
Lecture 9: Combining multiple neural networks to improve generalisation
TOPICS TO BE COVERED IN LECTURES 10-16
Deep Autoencoders (including semantic hashing and image search with binary codes)
Hopfield Nets and Simulated Annealing
Boltzmann machines and the general learning algorithm
Restricted Boltzmann machines and contrastive divergence learning
Applications of Restricted Boltzmann machines to collaborative filtering and document modelling.
Stacking restricted Boltzmann machines or shallow autoencoders to make deep nets.
The wake-sleep algorithm and its contrastive version
Recent applications of generatively pre-trained deep nets
Deep Boltzmann machines and how to pre-train them
Modelling hierarchical structure with neural nets

I think that rather than ploughing on with the training of my decision tree NN it would perhaps be better to finish this course before I get too carried away with myself with new NN ideas; for example, lecture 9, or the "stacking of Boltzman machines," might give me much better insight to the issues involved.

For these reasons I have decided to retrain my "reserve NN" on my enlarged data set with my new feature set, using both computers available to me, whilst I work through the above course. I expect that this reserve NN will be fully trained before the course ends, so then I will be free to experiment with my newly acquired knowledge.

Monday, 1 October 2012

Dominant Cycle Periods in End of Day Data

As part of the training of my neural net classifier it has become necessary, in order to speed up the training process, for me to have knowledge of the relative frequency of occurrences of specific dominant cycle periods. This is due to my having increased the size of my training data to almost five and a half million training/cross validation examples of my feature set. Given the amount of data now involved in training I have to subset it by period and train my neural nets on one period at a time. To decide the period training order I created a histogram of historical, dominant cycle period occurrences in all the commodity contracts/forex pairs I follow for all the data I have. The histogram is shown below.

The bin range is from 10 to 50, with a maximum occurring, the cyan line, at 18. The NN training schedule will follow this order:- 18, 19, 17, 20 etc... working downwards along those bins with the remaining outstanding maximum occurrences.

Looking at this histogram I'm struck by a couple of things: although all the indicators I use are adaptive to the current measured period I'm sure many people use "classical" TA indicators with their recommended default values, for example the 14 day RSI or 20 day Bollinger Band. The above histogram shows how "hit and miss" this can be. The 20 day Bollinger Band would seem to be OK with its look back parameter approximately covering the full cyclic period of the most frequently occurring periods in the data, but the RSI? The theoretical optimum for the RSI is half a period, which at 14 is a cyclic period of 28 days, the red line in the histogram. Not so good! And what about the Commodity Channel Index (CCI)? Its creator originally recommended a one third period look back, which based on the histogram above would imply a default setting of 6 or 7. Just goes to show what value there is in some very basic statistical analysis.

Sunday, 23 September 2012

NN Training Update

The next round of cross validation tests has shown that 600 to 1000 epochs per NN is the optimum number for training purposes. Also, the NN architecture has slightly evolved into having two output nodes in the output layer, one for each class in the binary NN classifier.

In an earlier post I said that I would use a hyperbolic tangent function as the activation functions, as per Y. LeCun (1998). The actual function from this paper is$$f(x) = 1.7159 tanh( (2/3) x )$$and its derivative is
$$f'(x) = (2*1.7159/3) ( 1 - tanh^2 ( (2/3) x) )$$I have written Octave functions for this and they are now being used in the final CV test to determine the optimum regularisation term to be used during NN training.

Thursday, 20 September 2012

NN Architecture CV Tests Complete

The cross validation test results are shown below, with the x-axis being the number of nodes in the hidden layer and the y-axis being the mean squared error of the trained NN (500 epochs) on the validation data set

and an enlarged view of x-axis values 20 to 30.

Basically what these charts say is that having more than 26 nodes in the hidden layer does not appreciably improve the performance of the NN. As a result, the final architecture of my proposed NNs will be 48 nodes in the input layer, 26 nodes in the hidden layer and 1 node in the output layer.

As an aside to all the above, I have just read an interesting paper entitled "Using Trading Dynamics to Boost Quantitative Strategy Performance," available from here. Some very interesting concepts that I shall probably get around to investigating in due course.

Wednesday, 19 September 2012

Neural Net Classifier Architecture Testing

Having successfully integrated the FANN library I've changed my mind as to the design of my neural net classifier. Previously I had planned to train a series of classifiers, each "tuned" to a specific measured period in the data, and each to discriminate between 5 distinct market types - my normal cyclic, uwr, unr, dwr and dnr as I've talked about in earlier posts.

Now, however, I'm working on a revised design that incorporates elements of a decision tree, where neural nets sit at interior nodes of the tree. A simple schematic is shown below

The advantage of this model is that I only have to train 5 neural nets, represented by the 5 colours in the above schematic, and each net is a binary classifier (-1 and 1) with only one node in its output layer, with a decision rule of > 0 or < 0 for the activation function. Rather than have one neural net per period, the period is now one of the features in the features input vector.

At the moment the features vector has a length of 48, and as I write this my computer is churning through a cross validation test (using the FANN library) to determine the optimum number of nodes for the single hidden layer(s) of the neural nets. Once this is complete, I plan to run another series of cross validation tests to determine the optimum number of epochs to run during training. When these are complete I shall then run one final cross validation test to determine the optimum lambda for a regularisation term. This last set of tests will use the Octave code previously used because, as far as I can see, FANN library functions do not appear to have this capability.

One slight change to this Octave code will be to use the Hyberbolic tangent as the activation function (see Y. LeCun 1998, available as paper 86 here). I may also try to implement some other tricks recommended in this paper.

Pages