Saturday 7 July 2012

A Possible Brute Force Similarity Classifier in Octave Code

As part of the development of my neural net classifier it has been necessary to use training data and as usual I have been using my model market types. To increase the amount of such training data I have extended the range of the data to include a change in market type half way through the cycle of one measured cyclic period. I have done this in increments of 1 degree from 1 degree to 360 degrees of a sine wave, for periods 15 to 50, for all possible combinations of market type, for a total database of 324,000 possible market model patterns. However, it struck me after reading this pdf that I could use this database as the basis of what is called in the pdf a "brute force similarity search" classifier. Below is my proof of concept Octave code implementation of such a classifier,
% first, training data "training_data.mat" should be loaded in command line

clear -exclusive X y % clear everything except y and X, previously loaded from the command line

lookup_value = input( 'Enter a number from 1 to 324,000 to choose a lookup candidate row from X: ' ) ;
fprintf( 'Based on this choice the market type to look up is :- ' ) ;
y( lookup_value , 1 )

tic() ;

% index into training set based on period measurement
[i_X j_X] = find( X(:,1) == X( lookup_value , 1 ) ) ;

% keep a record of all i_X indexes
all_i_X = i_X ;

% extract the relevant part of X using above index
X_look_up_matrix = X( [i_X] , 4:54 ) ;

% and same for market labels vector y
y_look_up_vector = y( [i_X] , 1 ) ;

% find pattern in X_look_up_matrix that minimises Euclidean distance between itself and the training example randomly taken from X
[ euc_dist_min i_euc_dist_min ] = min( sum( ( repmat( X(lookup_value,4:54), size(X_look_up_matrix,1), 1) .- X_look_up_matrix ) .^ 2.0 , 2 , 'extra' ) ) ;

fprintf( 'and the algo returns a market type of :- ' ) ;
% take this minimum distance vector index to get predicted market type
y_look_up_vector( i_euc_dist_min , 1 )

fprintf( '\nwith a calculated Euclidean distance of :- ' ) ;
fprintf( 'which ideally should be 0.0 on this X test set.\n' ) ;

fprintf( '\nOriginal lookup row check.\n' ) ;
original_i_X_check = all_i_X( i_euc_dist_min , 1 )
fprintf( 'which ideally should be the same as row choice entered.\n' ) ;

fprintf( '\nTime for algo to run.\n' ) ;

toc() ;
where X is the database already mentioned and y is a vector containing the market type labels. Typical terminal output of this code is

octave:1> bf_pattern_recognition
Enter a number from 1 to 324,000 to choose a lookup candidate row from X: 100235
Based on this choice the market type to look up is :- ans =  3
and the algo returns a market type of :- ans =  3

with a calculated Euclidean distance of :- ans = 0
which ideally should be 0.0 on this X test set.

Original lookup row check.
original_i_X_check =  100235
which ideally should be the same as row choice entered.

Time for algo to run.
Elapsed time is 0.1130519 seconds.

Of course it obtains 100 % accuracy on the test set X because the original choice of pattern to be matched comes from X so there is always an exact match to be found. The important thing is that this is a workable algorithm which, making allowances for all the print statements included in the above code, runs in hundredths of a second.

This speed, despite having such a large database to search through, is achieved by indexing into the database by the measured period of the pattern to be matched, which is the first entry on each line. This reduces the search base down to a more manageable 9000 row matrix, and then one line of vectorised code is used to perform the actual Euclidean distance search and classification.

Another possible advantage of this approach on real market data is that, having hopefully accurately classified the data, the matched pattern in the database can be extrapolated under the assumption that the market model will persist for the next 5 to 10 bars, to make a prediction of near future prices. I shall certainly be doing more work will this classifying algorithm!

No comments: