Wednesday 26 May 2021

Update on Recent Matrix Profile Work

Since my previous post, on Matrix Profile (MP), I have been doing a lot of online reading about MP and going back to various source papers and code that are available at the UCR Matrix Profile page. I have been doing this because, despite my initial enthusiasm, the R tsmp package didn't turn out to be suitable for what I wanted to do, or perhaps more correctly I couldn't hack it to get the sort of results I wanted, hence my need to go to "first principles" and code from the UCR page.

Readers may recall that my motivation was to look for time series motifs that form "initial balance (IB)" set ups of Market Profile charts. The rationale for this is that different IBs are precursors to specific market tendencies which may provide a clue or an edge in subsequent market action. A typical scenario from the literature on Market Profile might be "an Open Test Drive can often indicate one of the day's extremes." If this is actually true, one could go long/short with a high confidence stop at the identified extreme. Below is a screenshot of some typical IB profiles:

where each letter typically represents a 30 minute period of market action. The problem is that Market Profile charts, to me at least, are inherently visual and therefore do not easily lend themselves to an algorithmic treatment, which makes it difficult to back test in a robust fashion. This is why I have been trying to use MP.

The first challenge I faced was how to preprocess price action data such as OHLC and volume such that I could use MP. In the end I resorted to using the mid-price, the high-low range and (tick) volume as proxies for market direction, market volatility and market participation. Because IBs occur over market opens, I felt it was important to use the volatility and participation proxies as these are important markers for the sentiment of subsequent price action. This choice necessitated using a multivariate form of MP, and I used the basic MP STAMP code that is available at Matrix Profile VI: Meaningful Multidimensional Motif Discovery, with some slight tweaks for my use case.

Having the above tools in hand, what should they be used for? I decided that Cluster analysis is what is needed, i.e. cluster using the motifs that MP could discover. For this purpose, I used the approach outlined in section 3.9 of the paper "The Swiss Army Knife of Time Series Data Mining." The reasoning behind this choice is that if, for example, an "Open Test Drive IB" is a real thing, it should occur frequently enough that time series sub-sequences of it can be clustered or associated with an "Open Test Drive IB" motif. If all such prototype motifs can be identified and all IBs can be assigned to one of them, subsequent price action can be investigated to check the anecdotal claims, such as quoted above.

My Octave code implementation of the linked Swiss Army Knife routine is shown in the code box below.

data = dlmread( '/path/to/mv_data' ) ;
skip_loc = dlmread( '/path/to/skip_loc' ) ;
skip_loc_copy = find( skip_loc ) ; skip_loc_copy2 = skip_loc_copy ; skip_loc_copy3 = skip_loc_copy ;
sub_len = 9 ;
data_len = size( data , 1 ) ;
data_to_use = [ (data(:,2).+data(:,3))./2 , data(:,2).-data(:,3) , data(:,5) ] ;

must_dim = [] ;
exc_dim = [] ;
[ pro_mul , pro_idx , data_freq , data_mu , data_sig ] = multivariate_stamp( data_to_use, sub_len, must_dim, exc_dim, skip_loc ) ;
original_single_MP = pro_mul( : , 1 ) ; ## just mid price
original_single_MP2 = original_single_MP .+ pro_mul( : , 2 ) ; ## mid price and hi-lo range
original_single_MP3 = original_single_MP2 .+ pro_mul( : , 3 ) ; ## mid price, hi-lo range and volume

## Swiss Army Knife Clustering
RelMP = original_single_MP ; RelMP2 = original_single_MP2 ; RelMP3 = original_single_MP3 ;
DissMP = inf( length( RelMP ) , 1 ) ; DissMP2 = DissMP ; DissMP3 = DissMP ; 
minValStore = [] ; minIdxStore = [] ; minValStore2 = [] ; minIdxStore2 = [] ; minValStore3 = [] ; minIdxStore3 = [] ;
## set up a recording matrix 
all_dist_pro = zeros( size( RelMP , 1 ) , size( data_to_use , 2 ) ) ;

for ii = 1 : 500
## reset recording matrix for this ii loop  
all_dist_pro( : , : ) = 0 ;

## just mid price
[ minVal , minIdx ] = min( RelMP ) ;
minValStore = [ minValStore ; minVal ] ; minIdxStore = [ minIdxStore ; minIdx ] ;
DissmissRange = data_to_use( minIdx : minIdx + sub_len - 1 , : ) ;
[ dist_pro , ~ ] = multivariate_mass (data_freq(:,1), DissmissRange(:,1), data_len, sub_len, data_mu(:,1), data_sig(:,1), data_mu(minIdx,1), data_sig(minIdx,1) ) ;
all_dist_pro( : , 1 ) = real( dist_pro ) ;
JMP = all_dist_pro( : , 1 ) ;
DissMP = min( DissMP , JMP ) ; ## dismiss all motifs discovered so far
RelMP = original_single_MP ./ DissMP ;
skip_loc_copy = unique( [ skip_loc_copy ; ( minIdx : 1 : minIdx + sub_len - 1 )' ] ) ;
RelMP( skip_loc_copy ) = 1 ;

## mid price and hi-lo range
[ minVal , minIdx ] = min( RelMP2 ) ;
minValStore2 = [ minValStore2 ; minVal ] ; minIdxStore2 = [ minIdxStore2 ; minIdx ] ;
DissmissRange = data_to_use( minIdx : minIdx + sub_len - 1 , : ) ;
[ dist_pro , ~ ] = multivariate_mass (data_freq(:,1), DissmissRange(:,1), data_len, sub_len, data_mu(:,1), data_sig(:,1), data_mu(minIdx,1), data_sig(minIdx,1) ) ;
all_dist_pro( : , 2 ) = real( dist_pro ) ;
[ dist_pro , ~ ] = multivariate_mass (data_freq(:,2), DissmissRange(:,2), data_len, sub_len, data_mu(:,2), data_sig(:,2), data_mu(minIdx,2), data_sig(minIdx,2) ) ;
all_dist_pro( : , 2 ) = all_dist_pro( : , 2 ) .+ real( dist_pro ) ;
JMP2 = all_dist_pro( : , 2 ) ;
DissMP2 = min( DissMP2 , JMP2 ) ; ## dismiss all motifs discovered so far
RelMP2 = original_single_MP2 ./ DissMP2 ;
skip_loc_copy2 = unique( [ skip_loc_copy2 ; ( minIdx : 1 : minIdx + sub_len - 1 )' ] ) ;
RelMP2( skip_loc_copy2 ) = 1 ;

## mid price, hi-lo range and volume
[ minVal , minIdx ] = min( RelMP3 ) ;
minValStore3 = [ minValStore3 ; minVal ] ; minIdxStore3 = [ minIdxStore3 ; minIdx ] ;
DissmissRange = data_to_use( minIdx : minIdx + sub_len - 1 , : ) ;
[ dist_pro , ~ ] = multivariate_mass (data_freq(:,1), DissmissRange(:,1), data_len, sub_len, data_mu(:,1), data_sig(:,1), data_mu(minIdx,1), data_sig(minIdx,1) ) ;
all_dist_pro( : , 3 ) = real( dist_pro ) ;
[ dist_pro , ~ ] = multivariate_mass (data_freq(:,2), DissmissRange(:,2), data_len, sub_len, data_mu(:,2), data_sig(:,2), data_mu(minIdx,2), data_sig(minIdx,2) ) ;
all_dist_pro( : , 3 ) = all_dist_pro( : , 3 ) .+ real( dist_pro ) ;
[ dist_pro , ~ ] = multivariate_mass (data_freq(:,3), DissmissRange(:,3), data_len, sub_len, data_mu(:,3), data_sig(:,3), data_mu(minIdx,3), data_sig(minIdx,3) ) ;
all_dist_pro( : , 3 ) = all_dist_pro( : , 3 ) .+ real( dist_pro ) ;
JMP3 = all_dist_pro( : , 3 ) ;
DissMP3 = min( DissMP3 , JMP3 ) ; ## dismiss all motifs discovered so far
RelMP3 = original_single_MP3 ./ DissMP3 ;
skip_loc_copy3 = unique( [ skip_loc_copy3 ; ( minIdx : 1 : minIdx + sub_len - 1 )' ] ) ;
RelMP3( skip_loc_copy3 ) = 1 ;

endfor ## end ii loop

There are a few things to note about this code:

  • the use of a skip_loc vector 
  • a sub_len value of 9
  • 3 different calculations for DissMP and RelMP vectors

i) The skip_loc vector is a vector of time series indices (Idx) for which the MP and possible cluster motifs should not be calculated to avoid identifying motifs from data sequences that do not occur in the underlying data due to the way I concatenated it during pre-processing, i.e. 7am to 9am, 7am to 9am, ... etc.

ii) sub_len value of 9 means 9 x 10 minute OHLC bars, to match the 30 minute A, B and C of the above IB screenshot.

iii)  3 different calculations because different combinations of the underlying data are used. 

This last part probably needs more explanation. A multivariate RelMP is created by adding together individual dist_pros (distance profiles), and the cluster motif identification is achieved by finding minimums in the RelMP; however, a minimum in a multivariate RelMP is generally a different minimum to the minimums of the individual, univariate RelMPs. What my code does is use a univariate RelMP of the mid price, and 2 multivariate RelMPs of mid price plus high-low range and mid price, high-low range and volume. This gives 3 sets of minValues and minValueIdxs, one for each set of data. The idea is to run the ii loop for, e.g. 500 iterations, and to then identify possible "robust" IB cluster motifs by using the Octave intersect function to get the minIdx that are common to all 3 sets of Idx data. 

By way of example, setting the ii loop iteration to just 100 results in only one intersect Idx value on some EUR_USD forex data, the plot of which is shown below:

Comparing this with the IB screenshot above, I would say this represents a typical "Open Auction" process with prices rotating upwards/downwards with no real conviction either way, with a possible long breakout on the last bar or alternatively, a last upwards test before a price plunge.

My intent is to use the above methodology to get a set of candidate IB motifs upon which a clustering algorithm can be based. This clustering algorithm will be the subject of my next post.