Sunday 29 November 2020

Temporal Clustering on Real Prices, Part 2

Below are some more out of sample plots for the Temporal Clustering solutions of the EUR_USD forex pair for the week just gone. The details of how these solutions are derived is explained in my previous post, Temporal Clustering on Real Prices. First is Tuesday's solution

where the major (blue vertical lines) turns are a combination of optimal K values of 6 and 7 (5 sets of data in total) plus 2 sets of data each for K = 9 and 11 (red and green vertical lines). The price plot is
Next up is Wednesday's solution
where the blue vertical lines represent 5 sets of data with K = 6 and the red and green vertical lines 3 sets and 1 set with K = 9 and K= 11 respectively. The price plot is
Thursday's solution is
where black/blue vertical lines are K values of 9 and 6 respectively, whilst green/red are K values 10 and 7. Thursday's price plot is
Finally, Friday's solution is
where the major blue vertical lines are K = 9 over 5 sets of data, with the remainder being K = 5, 6 and 11 over the last 4 sets of data. Friday's price plot is

The above seems to tie in nicely with my previous post about Forex Intraday Seasonality whereby the above identified turning points signify the end points of said intraday tendencies to trend. Readers might also be interested in another paper I have come across, Segmentation and Time-of-Day Patterns in Foreign Exchange Markets, which gives a possible, theoretical explanation as to why such patterns manifest themselves. In particular, for the EUR_USD pair, the paper states  

  • "the US dollar appreciates significantly from 8:00 to 12:00 GMT
    and the euro appreciates significantly from 16:00 to 22:00 GMT"

Readers can judge for themselves whether this appears to be true, out of sample, by inspecting the above plots. Enjoy!

Tuesday 24 November 2020

Temporal Clustering on Real Prices

Having now had time to run the code shown in my previous post, Temporal Clustering, part 3, in this post I want to show the results on real prices.

Firstly, I have written two functions in Octave to identify market turning points and each function takes as input an n_bar argument which determines the lookback/lookforward length along price series to determine local relative highs and lows. I ran both these for n_bar values of 1 to 15 inclusive on EUR_USD forex 10 minute bars from July 2012 upto and including last week's set of 10 minute bars. I created 3 sets of turning point data per function by averaging the function outputs over n_bar 1 - 15, 1 - 6 and 7 - 15, and also averaged the outputs over the average of the 2 functions over the same ranges. In total this gives 9 slightly different sets of turning point data.

I then ran the optimal K clustering code, shown in previous posts, over each set of data to get the "solutions" per set of data. Six of the sets had an optimal K value of 8 and a combined plot of these is shown below.

For each "solution" turning point ix (ix ranges from 1 to 198) a turning point value of 1 is added to get a sort of spike train plot through time. The ix = 1 value is 22:00 BST on Sunday and ix = 198 is 06:50 BST on Tuesday. I chose this range so that there would be a buffer at each end of the time range I am really interested in: 7:00 BST to 22:00 BST, which covers the time from the London open to the New York close. The vertical blue lines are plotted for clarity to help identify the the turns and are plotted as 3 consecutive lines 10 minutes apart. The added text shows the time of occurence of the first bar of each triplet of lines, the time being London BST. The following second plot is the same as above but with the other 3 "solutions" of K = 5, 10 and 11 added.
For those readers who are familiar with the Delta Phenomenon the main vertical blue lines could conceptually be thought of as MTD lines with the other lines being lower timeframe ITD lines, but on an intraday scale. However, it is important to bear in mind that this is NOT a Delta solution and therefore rules about numbering, alternating highs and lows and inversions etc. do not apply. It is more helpful to think in terms of probability and see the various spikes/lines as indicating times of the day at which there is a higher probability of price making a local high or low. The size of a move after such a high or low is not indicated, and the timings are only approximate or alternatively represent the centre of a window in which the high or low might occur.

The proof of the pudding is in the eating, however, and the following plots are yesterday's (23 November 2020) out of sample EUR_USD forex pair price action with the lines of the above "solution" overlaid. The first plot is just the K = 8 solution plot

whilst this second plot has all lines shown.
Given the above caveats about caution with regards to the lines only being probabilities, it seems uncanny how accurately the major highs and lows of the day are picked out. I only wish I had done this analysis sooner as then yesterday could have been one of my best trading days ever!

More soon.

Saturday 14 November 2020

Temporal Clustering, Part 3

Continuing on with the subject matter of my last post, in the code box below there is R code which is a straight forward refactoring of the Octave code contained in the second code box of my last post. This code is my implementation of the cross validation routine described in the paper Cluster Validation by Prediction Strength, but adapted for use in the one dimensional case. I have refactored this into R code so that I can use the Ckmeans.1d.dp package for optimal, one dimensional clustering.

library( Ckmeans.1d.dp )

## load the training data from Octave output (comment out as necessary )
data = read.csv( "~/path/to//all_data_matrix" , header = FALSE )

## comment out as necessary
adjust = 0 ## default adjust value
sum_seq = seq( from = 1 , to = 198 , by = 1 ) ; adjust = 1 ; sum_seq_l = as.numeric( length( sum_seq ) )## Monday
##sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Tuesday
##sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Wednesday
##sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Thursday
##sum_seq = seq( from = 547 , to = 720 , by = 1 ) ; adjust = 2 ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Friday

## intraday --- commnet out or adjust as necessary
##sum_seq = seq( from = 25 , to = 100 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) )

upper_tri_mask = 1 * upper.tri( matrix( 0L , nrow = sum_seq_l , ncol = sum_seq_l ) , diag = FALSE )
no_sample_iters = 1000
max_K = 20
all_k_ps = matrix( 0L , nrow = 1 , ncol = max_K )

for ( iters in 1 : no_sample_iters ) {

## sample the data in data by rows
train_ix = sample( nrow( data ) , size = round( nrow( data ) / 2 ) , replace = FALSE )
train_data = data[ train_ix , sum_seq ] ## extract training data using train_ix rows of data
train_data_sum = colSums( train_data )  ## sum down the columns of train_data
test_data = data[ -train_ix , sum_seq ] ## extract test data using NOT train_ix rows of data
test_data_sum = colSums( test_data )    ## sum down the columns of test_data
## adjust for weekend if necessary
if ( adjust == 1 ) { ## Monday, so correct artifacts of weekend gap
  train_data_sum[ 1 : 5 ] = mean( train_data_sum[ 1 : 48 ] )
  test_data_sum[ 1 : 5 ] = mean( test_data_sum[ 1 : 48 ] )   
} else if ( adjust == 2 ) { ## Friday, so correct artifacts of weekend gap
  train_data_sum[ ( sum_seq_l - 4 ) : sum_seq_l ] = mean( train_data_sum[ ( sum_seq_l - 47 ) : sum_seq_l ] )
  test_data_sum[  ( sum_seq_l - 4 ) : sum_seq_l ] = mean( test_data_sum[ ( sum_seq_l - 47 ) : sum_seq_l ] ) 
}

for ( k in 1 : max_K ) {
  
## K segment train_data_sum
train_res = Ckmeans.1d.dp( sum_seq , k , train_data_sum )
train_out_pairs_mat = matrix( 0L , nrow = sum_seq_l , ncol = sum_seq_l )

## K segment test_data_sum
test_res = Ckmeans.1d.dp( sum_seq , k , test_data_sum )
test_out_pairs_mat = matrix( 0L , nrow = sum_seq_l , ncol = sum_seq_l )

  for ( ii in 1 : length( train_res$centers ) ) {
    ix = which( train_res$cluster == ii )
    train_out_pairs_mat[ ix , ix ] = 1 
    ix = which( test_res$cluster == ii )
    test_out_pairs_mat[ ix , ix ] = 1
    }
  ## coerce to upper triangular matrix
  train_out_pairs_mat = train_out_pairs_mat * upper_tri_mask
  test_out_pairs_mat = test_out_pairs_mat * upper_tri_mask
  
  ## get minimum co-membership cluster proportion
  sample_min_vec = matrix( 0L , nrow = 1 , ncol = length( test_res$centers ) )
  for ( ii in 1 : length( test_res$centers ) ) {
    ix = which( test_res$cluster == ii )
    test_cluster_sum = sum( test_out_pairs_mat[ ix , ix ] )
    train_cluster_sum = sum( test_out_pairs_mat[ ix , ix ] * train_out_pairs_mat[ ix , ix ] )
    sample_min_vec[ , ii ] = train_cluster_sum / test_cluster_sum
  }
  
  ## get min of sample_min_vec
  min_val = min( sample_min_vec[ !is.nan( sample_min_vec ) ] ) ## removing any NaN
  all_k_ps[ , k ] = all_k_ps[ , k ] + min_val

} ## end of K for loop

} ## end of sample loop

all_k_ps = all_k_ps / no_sample_iters ## average values
plot( 1 : length( all_k_ps ) , all_k_ps , "b" , xlab = "Number of Clusters K" , ylab = "Prediction Strength Value" )
abline( h = 0.8 , col = "red" )

The purpose of the cross validation routine is to select the number of clusters K, in the model selection sense, that is best supported by the available data. The above linked paper suggests that the optimal number of clusters K is the highest number K that has a prediction strength value over some given threshold (e.g. 0.8 or 0.9). The last part of the code plots the values of prediction strength for K (x-axis) vs. prediction strength (y-axis), along with the threshold value of 0.8 in red. For the particular set of data in question, it can be seen that the optimal K value for the number of clusters is 8.

This second code box shows code, re-using some of the above code, to visualise the clusters for a given K,
library( Ckmeans.1d.dp )

## load the training data from Octave output (comment out as necessary )
data = read.csv( "~/path/to/all_data_matrix" , header = FALSE )
data_sum = colSums( data ) ## sum down the columns of data
data_sum[ 1 : 5 ] = mean( data_sum[ 1 : 48 ] ) ## correct artifacts of weekend gap
data_sum[ 716 : 720 ] = mean( data_sum[ 1 : 48 ] ) ## correct artifacts of weekend gap

## comment out as necessary
adjust = 0 ## default adjust value
sum_seq = seq( from = 1 , to = 198 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Monday
##sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Tuesday
# sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Wednesday
# sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Thursday
##sum_seq = seq( from = 547 , to = 720 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Friday

## intraday --- commnet out or adjust as necessary
##sum_seq = seq( from = 25 , to = 100 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) )

k = 8
res = Ckmeans.1d.dp( sum_seq , k , data_sum[ sum_seq ] )

plot( sum_seq , data_sum[ sum_seq ], main = "Cluster centres. Cluster centre ix is a predicted turning point",
     col = res$cluster,
     pch = res$cluster, type = "h", xlab = "Count from beginning ix at ix = 1",
     ylab = "Total Counts per ix" )

abline( v = res$centers, col = "chocolate" , lty = "dashed" )

text( res$centers, max(data_sum[sum_seq]) * 0.95, cex = 0.75, font = 2,
      paste( round(res$centers) ) )
a typical plot for which is shown below.
The above plot can be thought of as a clustering at a particular scale, and one can go down in scale by selecting smaller ranges of the data. For example, taking all the datum clustered in the 3 clusters centred at x-axis ix values 38, 63 and 89 and re-running the code in the first code box on just this data gives this prediction strength plot, which suggests a K value of 6.
Re-running the code in the second code box plots these 6 clusters thus.

Looking at this last plot, it can be seen that there is a cluster at x-axis ix value 58, which corresponds to 7.30 a.m. London time, and within this green cluster there are 2 distinct peaks which correspond to 7.00 a.m. and 8.00 a.m. A similar, visual analysis of the far right cluster, centre ix = 94, shows a peak at the time of the New York open.

My hypothesis is that by clustering in the above manner it will be possible to identify distinct, intraday times at which the probability of a market turn is greater than at other times. More in due course.

Monday 9 November 2020

A Temporal Clustering Function, Part 2

Further to my previous post, below is an extended version of the "blurred_maxshift_1d_linear" function. This updated version has two extra outputs: a vector of the cluster centre index ix values and a vector the same length as the input data with the cluster centres to which each datum has been assigned. These changes have necessitated some extensive re-writing of the function to include various checks contained in nested conditional statements.

## Copyright (C) 2020 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{train_vec}, @var{cluster_centre_ix}, @var{assigned_cluster_centre_ix} =} blurred_maxshift_1d_linear_V2 (@var{train_vec}, @var{bandwidth})
##
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2020-10-21

function [ new_train_vec , cluster_centre_ix , assigned_cluster_centre_ix ] = blurred_maxshift_1d_linear_V2 ( train_vec , bandwidth )

if ( nargin < 2 )
 bandwidth = 1 ;
endif

if ( numel( train_vec ) < 2 * bandwidth + 1 )
 error( 'Bandwidth too wide for length of train_vec.' ) ;
endif

length_train_vec = numel( train_vec ) ;
new_train_vec = zeros( size( train_vec ) ) ;
assigned_cluster_centre_ix = ( 1 : 1 : length_train_vec ) ;

## initialising loop
## do the beginning 
[ ~ , ix ] = max( train_vec( 1 : 2 * bandwidth + 1 ) ) ;
new_train_vec( ix ) = sum( train_vec( 1 : bandwidth + 1 ) ) ;
assigned_cluster_centre_ix( 1 : bandwidth + 1 ) = ix ;

## and end of train_vec first
[ ~ , ix ] = max( train_vec( end - 2 * bandwidth : end ) ) ;
new_train_vec( end - 2 * bandwidth - 1 + ix ) = sum( train_vec( end - bandwidth : end ) ) ;
assigned_cluster_centre_ix( end - bandwidth : end ) = length_train_vec - 2 * bandwidth - 1 + ix ;

for ii = ( bandwidth + 2 ) : ( length_train_vec - bandwidth - 1 )
 [ ~ , ix ] = max( train_vec( ii - bandwidth : ii + bandwidth ) ) ;
 new_train_vec( ii - bandwidth - 1 + ix ) += train_vec( ii ) ;
 assigned_cluster_centre_ix( ii ) = ii - bandwidth - 1 + ix ; 
endfor
## end of initialising loop

train_vec = new_train_vec ;

## initialise the while condition variable
has_converged = 0 ;

while ( has_converged < 1 )

new_train_vec = zeros( size( train_vec ) ) ;

## do the beginning 
[ ~ , ix ] = max( train_vec( 1 : 2 * bandwidth + 1 ) ) ;
new_train_vec( ix ) += sum( train_vec( 1 : bandwidth + 1 ) ) ;
assigned_cluster_centre_ix( 1 : bandwidth + 1 ) = ix ;

## and end of train_vec first
[ ~ , ix ] = max( train_vec( end - 2 * bandwidth : end ) ) ;
new_train_vec( end - 2 * bandwidth - 1 + ix ) += sum( train_vec( end - bandwidth : end ) ) ;
assigned_cluster_centre_ix( end - bandwidth : end ) = length_train_vec - 2 * bandwidth - 1 + ix ;

for ii = ( bandwidth + 2 ) : ( length_train_vec - bandwidth - 1 )

 [ max_val , ix ] = max( train_vec( ii - bandwidth : ii + bandwidth ) ) ;
 ## check for ties in max_val value in window
 no_ties = sum( train_vec( ii - bandwidth : ii + bandwidth ) == max_val ) ;

  if ( no_ties == 1 && max_val == train_vec( ii ) && ix == bandwidth + 1 ) ## main if
   ## value in train_vec(ii) is max val of window, with no ties
   new_train_vec( ii ) += train_vec( ii ) ;
   assigned_cluster_centre_ix( ii ) = ii ;

  elseif ( no_ties == 1 && max_val != train_vec( ii ) && ix != bandwidth + 1 ) ## main if
   ## no ties for max_val, but need to move data at ii and change ix
   ## get assigned_cluster_centre_ix that point to ii, which needs to be updated
   assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
    if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
     assigned_cluster_centre_ix( assigned_ix ) = ii - ( bandwidth + 1 ) + ix ;
    elseif ( isempty( assigned_ix ) ) ## but cheap insurance
     assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + ix ;
    endif
   new_train_vec( ii - ( bandwidth + 1 ) + ix ) += train_vec( ii ) ;

  elseif ( no_ties > 1 && max_val > train_vec( ii ) ) ## main if
   ## 2 ties for max_val, which is > val at ii, need to move data at ii 
   ## to the closer max_val ix and change ix in assigned_cluster_centre_ix
   match_max_val_ix = find( train_vec( ii - bandwidth : ii + bandwidth ) == max_val ) ;

    if ( numel( match_max_val_ix ) == 2 ) ## only 2 matching max vals
     centre_window_dist = ( bandwidth + 1 ) .- match_max_val_ix ;

           if ( abs( centre_window_dist( 1 ) ) == abs( centre_window_dist( 2 ) ) ) 
            ## equally distant from centre ii of moving window

                  assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
                   if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
                    ix_before = find( assigned_ix < ii ) ;
                    ix_after = find( assigned_ix > ii ) ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( 1 ) ) += train_vec( ii ) / 2 ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( 2 ) ) += train_vec( ii ) / 2 ;
                    assigned_cluster_centre_ix( assigned_ix( ix_before ) ) = ii - ( bandwidth + 1 ) + match_max_val_ix( 1 ) ;
                    assigned_cluster_centre_ix( assigned_ix( ix_after ) ) = ii - ( bandwidth + 1 ) + match_max_val_ix( 2 ) ;
                    assigned_cluster_centre_ix( ii ) = ii ; ## bit of a kluge
                   elseif ( isempty( assigned_ix ) ) ## but cheap insurance
                    ## no other assigned_cluster_centre_ix values to account for, so just split equally
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( 1 ) ) += train_vec( ii ) / 2 ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( 2 ) ) += train_vec( ii ) / 2 ;
                    assigned_cluster_centre_ix( ii ) = ii ; ## bit of a kluge
                   else
                    error( 'There is an unknown error in instance ==2 matching max_vals with equal distances to centre of moving window with assigned_ix. Write code to deal with this edge case.' ) ;
                   endif

           else ## not equally distant from centre ii of moving window

                  assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
                   if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
                    ## There is an instance == 2 matching max_vals with non equal distances to centre of moving window with previously assigned_ix to ii ix
                    ## Assign all assigned_ix to the nearest max value ix
                    [ ~ , min_val_ix ] = min( [ abs( centre_window_dist( 1 ) ) abs( centre_window_dist( 2 ) ) ] ) ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ) += train_vec( ii ) ;
                    assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ;
                    assigned_cluster_centre_ix( assigned_ix ) = ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ;
                   elseif ( isempty( assigned_ix ) ) ## but cheap insurance
                    [ ~ , min_val_ix ] = min( abs( centre_window_dist ) ) ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ) += train_vec( ii ) ;
                    assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ;
                   else
                    error( 'There is an unknown error in instance of ==2 matching max_vals with unequal distances. Write the code to deal with this edge case.' ) ;
                   endif

           endif ## 

    elseif ( numel( match_max_val_ix ) > 2  ) ## There is an instance of >2 matching max_vals.
    ## There must be one max val closer than the others or two equally close
     centre_window_dist = abs( ( bandwidth + 1 ) .- match_max_val_ix ) ;
     centre_window_dist_min = min( centre_window_dist ) ;
     centre_window_dist_min_ix = find( centre_window_dist == centre_window_dist_min ) ;
     
       if ( numel( centre_window_dist_min_ix ) == 1 ) ## there is one closet ix
        assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
        
            if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ) += train_vec( ii ) ;  
               assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ;
               assigned_cluster_centre_ix( assigned_ix ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ;
            elseif ( isempty( assigned_ix ) ) ## but cheap insurance
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ) += train_vec( ii ) ;  
               assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ;
            endif
         
       elseif ( numel( centre_window_dist_min_ix ) == 2 ) ## there are 2 equally close ix
        assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
        
            if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
               ix_before = find( assigned_ix < ii ) ;
               ix_after = find( assigned_ix > ii ) ;
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 1 ) ) += train_vec( ii ) / 2 ;
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 2 ) ) += train_vec( ii ) / 2 ;
               assigned_cluster_centre_ix( assigned_ix( ix_before ) ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 1 ) ;
               assigned_cluster_centre_ix( assigned_ix( ix_after ) ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 2 ) ;
               assigned_cluster_centre_ix( ii ) = ii ; ## bit of a kluge             
            elseif ( isempty( assigned_ix ) ) ## but cheap insurance 
               ## no other assigned_cluster_centre_ix values to account for, so just split equally
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 1 ) ) += train_vec( ii ) / 2 ;
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 2 ) ) += train_vec( ii ) / 2 ;
               assigned_cluster_centre_ix( ii ) = ii ; ## bit of a kluge
            endif
        
       else
        error( 'Unknown error in numel( match_max_val_ix ) > 2.' ) ;
       endif
     ##error( 'There is an instance of >2 matching max_vals. Write the code to deal with this edge case.' ) ;
    else
     error( 'There is an unknown error in instance of >2 matching max_vals. Write the code to deal with this edge case.' ) ;
    endif

  endif ## main if end

endfor

if ( sum( ( train_vec == new_train_vec ) ) == length_train_vec )
 has_converged = 1 ;
else
 train_vec = new_train_vec ;
endif

endwhile

cluster_centre_ix = unique( assigned_cluster_centre_ix ) ;
cluster_centre_ix( cluster_centre_ix == 0 ) = [] ;

endfunction

The reason for this re-write was to accommodate a cross validation routine, which is described in the paper Cluster Validation by Prediction Strength, and a simple outline of which is given in this stackexchange.com answer.

My Octave code implementation of this is shown in the code box below. This is not exactly as described in the above paper because the number of clusters, K, is not exactly specified due to the above function automatically determining K based on the data. The routine below is perhaps more accurately described as being inspired by the original paper.

## create train and test data sets
########## UNCOMMENT AS NECESSARY #####
time_ix = [ 1 : 198 ] ; ## Monday
##time_ix = [ 115 : 342 ] ; ## Tuesday
##time_ix = [ 259 : 486 ] ; ## Wednesday
##time_ix = [ 403 : 630 ] ; ## Thursday
##time_ix = [ 547 : 720 ] ; ## Friday
##time_ix = [ 1 : 720 ] ; ## all data
#######################################

all_cv_solutions = zeros( size( data_matrix , 3 ) , size( data_matrix , 3 ) ) ;

n_iters = 1 ;
for iter = 1 : n_iters ## related to # of rand_ix sets generated 
 
rand_ix = randperm( size( data_matrix , 1 ) ) ; 
train_ix = rand_ix( 1 : round( numel( rand_ix ) * 0.5 ) ) ;
test_ix = rand_ix( round( numel( rand_ix ) * 0.5 ) + 1 : end ) ;
train_data_matrix = sum( data_matrix( train_ix , time_ix , : ) ) ;
test_data_matrix = sum( data_matrix( test_ix , time_ix , : ) ) ;

all_proportions_indicated = zeros( 1 , size( data_matrix , 3 ) ) ;

for cv_ix = 1 : size( data_matrix , 3 ) ; ## related to delta_turning_point_filter n_bar parameter

for bandwidth = 1 : size( data_matrix , 3 )

## train set clustering
if ( bandwidth == 1 )
[ train_out , cluster_centre_ix_train , assigned_cluster_centre_ix_train ] = blurred_maxshift_1d_linear_V2( train_data_matrix(:,:,cv_ix) , bandwidth ) ;
elseif( bandwidth > 1 )
[ train_out , cluster_centre_ix_train , assigned_cluster_centre_ix_train ] = blurred_maxshift_1d_linear_V2( train_out , bandwidth ) ;
endif
train_out_pairs_mat = zeros( numel( assigned_cluster_centre_ix_train ) , numel( assigned_cluster_centre_ix_train ) ) ;
 for ii = 1 : numel( cluster_centre_ix_train )
  cc_ix = find( assigned_cluster_centre_ix_train == cluster_centre_ix_train( ii ) ) ;
  train_out_pairs_mat( cc_ix , cc_ix ) = 1 ;
 endfor
train_out_pairs_mat = triu( train_out_pairs_mat , 1 ) ; ## get upper diagonal matrix

## test set clustering
if ( bandwidth == 1 )
[ test_out , cluster_centre_ix_test , assigned_cluster_centre_ix_test ] = blurred_maxshift_1d_linear_V2( test_data_matrix(:,:,cv_ix) , bandwidth ) ;
elseif( bandwidth > 1 )
[ test_out , cluster_centre_ix_test , assigned_cluster_centre_ix_test ] = blurred_maxshift_1d_linear_V2( test_out , bandwidth ) ;
endif

all_test_out_pairs_mat = zeros( numel( assigned_cluster_centre_ix_test ) , numel( assigned_cluster_centre_ix_test ) ) ;
test_out_pairs_clusters_proportions = ones( 1 , numel( cluster_centre_ix_test ) ) ;
 for ii = 1 : numel( cluster_centre_ix_test )
  cc_ix = find( assigned_cluster_centre_ix_test == cluster_centre_ix_test( ii ) ) ;
  all_test_out_pairs_mat( cc_ix , cc_ix ) = 1 ;
  test_out_pairs_mat = all_test_out_pairs_mat( cc_ix , cc_ix ) ;
  test_out_pairs_mat = triu( test_out_pairs_mat , 1 ) ; ## get upper diagonal matrix
  test_out_pairs_mat_sum = sum( sum( test_out_pairs_mat ) ) ;
   if ( test_out_pairs_mat_sum > 0 )
   test_out_pairs_clusters_proportions( ii ) = sum( sum( train_out_pairs_mat( cc_ix , cc_ix ) .* test_out_pairs_mat ) ) / ...
                                                test_out_pairs_mat_sum ;
   endif
 endfor

all_proportions_indicated( bandwidth ) = min( test_out_pairs_clusters_proportions ) ;
all_cv_solutions( bandwidth , cv_ix ) += all_proportions_indicated( bandwidth ) ; 

endfor ## bandwidth for loop

endfor ## end of cv_ix loop

endfor ## end of iter for

all_cv_solutions = all_cv_solutions ./ n_iters ;

surf( all_cv_solutions ) ; xlabel( 'BANDWIDTH' , 'fontsize' , 20 ) ; ylabel( 'CV IX' , 'fontsize' , 20 ) ;
I won't discuss the workings of the code any further as readers are free to read the original paper and my code interpretation of it. A typical surface plot of the output is shown below.

The "bandwidth" plotted along the front edge of the surface plot is one of the input parameters to the "blurred_maxshift_1d_linear" function, whilst the "lookback" is a parameter of the original data generating function which identifies local highs and lows in a price time series. There appears to be a distinct "elbow" at "lookback" = 6 which is more or less consistent for all values of "bandwidth." Since the underlying data for this is 10 minute OHLC bars, the ideal "lookback" would, therefore, appear to be on the hourly timeframe.

However, having spent some considerable time and effort to get the above working satisfactorily, I am now not so sure that I'll actually use the above code. The reason for this is shown in the following animated GIF.

This shows K segmentation of the exact same data used above, from K = 1 to 19 inclusive, using R and its Ckmeans.1d.dp package, with vignette tutorial here. I am particularly attracted to this because of its speed, compared to my code above, as well as its guarantees with regard to optimality and reproducability. If one stares at the GIF for long enough one can see possible, significant clusters at index values (x-axis) which correspond, approximately, to particularly significant times such as London and New York market opening and closing times: ix = 55, 85, 115 and 145.

More about this in my next post.