Monday 9 November 2020

A Temporal Clustering Function, Part 2

Further to my previous post, below is an extended version of the "blurred_maxshift_1d_linear" function. This updated version has two extra outputs: a vector of the cluster centre index ix values and a vector the same length as the input data with the cluster centres to which each datum has been assigned. These changes have necessitated some extensive re-writing of the function to include various checks contained in nested conditional statements.

## Copyright (C) 2020 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{train_vec}, @var{cluster_centre_ix}, @var{assigned_cluster_centre_ix} =} blurred_maxshift_1d_linear_V2 (@var{train_vec}, @var{bandwidth})
##
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2020-10-21

function [ new_train_vec , cluster_centre_ix , assigned_cluster_centre_ix ] = blurred_maxshift_1d_linear_V2 ( train_vec , bandwidth )

if ( nargin < 2 )
 bandwidth = 1 ;
endif

if ( numel( train_vec ) < 2 * bandwidth + 1 )
 error( 'Bandwidth too wide for length of train_vec.' ) ;
endif

length_train_vec = numel( train_vec ) ;
new_train_vec = zeros( size( train_vec ) ) ;
assigned_cluster_centre_ix = ( 1 : 1 : length_train_vec ) ;

## initialising loop
## do the beginning 
[ ~ , ix ] = max( train_vec( 1 : 2 * bandwidth + 1 ) ) ;
new_train_vec( ix ) = sum( train_vec( 1 : bandwidth + 1 ) ) ;
assigned_cluster_centre_ix( 1 : bandwidth + 1 ) = ix ;

## and end of train_vec first
[ ~ , ix ] = max( train_vec( end - 2 * bandwidth : end ) ) ;
new_train_vec( end - 2 * bandwidth - 1 + ix ) = sum( train_vec( end - bandwidth : end ) ) ;
assigned_cluster_centre_ix( end - bandwidth : end ) = length_train_vec - 2 * bandwidth - 1 + ix ;

for ii = ( bandwidth + 2 ) : ( length_train_vec - bandwidth - 1 )
 [ ~ , ix ] = max( train_vec( ii - bandwidth : ii + bandwidth ) ) ;
 new_train_vec( ii - bandwidth - 1 + ix ) += train_vec( ii ) ;
 assigned_cluster_centre_ix( ii ) = ii - bandwidth - 1 + ix ; 
endfor
## end of initialising loop

train_vec = new_train_vec ;

## initialise the while condition variable
has_converged = 0 ;

while ( has_converged < 1 )

new_train_vec = zeros( size( train_vec ) ) ;

## do the beginning 
[ ~ , ix ] = max( train_vec( 1 : 2 * bandwidth + 1 ) ) ;
new_train_vec( ix ) += sum( train_vec( 1 : bandwidth + 1 ) ) ;
assigned_cluster_centre_ix( 1 : bandwidth + 1 ) = ix ;

## and end of train_vec first
[ ~ , ix ] = max( train_vec( end - 2 * bandwidth : end ) ) ;
new_train_vec( end - 2 * bandwidth - 1 + ix ) += sum( train_vec( end - bandwidth : end ) ) ;
assigned_cluster_centre_ix( end - bandwidth : end ) = length_train_vec - 2 * bandwidth - 1 + ix ;

for ii = ( bandwidth + 2 ) : ( length_train_vec - bandwidth - 1 )

 [ max_val , ix ] = max( train_vec( ii - bandwidth : ii + bandwidth ) ) ;
 ## check for ties in max_val value in window
 no_ties = sum( train_vec( ii - bandwidth : ii + bandwidth ) == max_val ) ;

  if ( no_ties == 1 && max_val == train_vec( ii ) && ix == bandwidth + 1 ) ## main if
   ## value in train_vec(ii) is max val of window, with no ties
   new_train_vec( ii ) += train_vec( ii ) ;
   assigned_cluster_centre_ix( ii ) = ii ;

  elseif ( no_ties == 1 && max_val != train_vec( ii ) && ix != bandwidth + 1 ) ## main if
   ## no ties for max_val, but need to move data at ii and change ix
   ## get assigned_cluster_centre_ix that point to ii, which needs to be updated
   assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
    if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
     assigned_cluster_centre_ix( assigned_ix ) = ii - ( bandwidth + 1 ) + ix ;
    elseif ( isempty( assigned_ix ) ) ## but cheap insurance
     assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + ix ;
    endif
   new_train_vec( ii - ( bandwidth + 1 ) + ix ) += train_vec( ii ) ;

  elseif ( no_ties > 1 && max_val > train_vec( ii ) ) ## main if
   ## 2 ties for max_val, which is > val at ii, need to move data at ii 
   ## to the closer max_val ix and change ix in assigned_cluster_centre_ix
   match_max_val_ix = find( train_vec( ii - bandwidth : ii + bandwidth ) == max_val ) ;

    if ( numel( match_max_val_ix ) == 2 ) ## only 2 matching max vals
     centre_window_dist = ( bandwidth + 1 ) .- match_max_val_ix ;

           if ( abs( centre_window_dist( 1 ) ) == abs( centre_window_dist( 2 ) ) ) 
            ## equally distant from centre ii of moving window

                  assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
                   if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
                    ix_before = find( assigned_ix < ii ) ;
                    ix_after = find( assigned_ix > ii ) ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( 1 ) ) += train_vec( ii ) / 2 ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( 2 ) ) += train_vec( ii ) / 2 ;
                    assigned_cluster_centre_ix( assigned_ix( ix_before ) ) = ii - ( bandwidth + 1 ) + match_max_val_ix( 1 ) ;
                    assigned_cluster_centre_ix( assigned_ix( ix_after ) ) = ii - ( bandwidth + 1 ) + match_max_val_ix( 2 ) ;
                    assigned_cluster_centre_ix( ii ) = ii ; ## bit of a kluge
                   elseif ( isempty( assigned_ix ) ) ## but cheap insurance
                    ## no other assigned_cluster_centre_ix values to account for, so just split equally
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( 1 ) ) += train_vec( ii ) / 2 ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( 2 ) ) += train_vec( ii ) / 2 ;
                    assigned_cluster_centre_ix( ii ) = ii ; ## bit of a kluge
                   else
                    error( 'There is an unknown error in instance ==2 matching max_vals with equal distances to centre of moving window with assigned_ix. Write code to deal with this edge case.' ) ;
                   endif

           else ## not equally distant from centre ii of moving window

                  assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
                   if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
                    ## There is an instance == 2 matching max_vals with non equal distances to centre of moving window with previously assigned_ix to ii ix
                    ## Assign all assigned_ix to the nearest max value ix
                    [ ~ , min_val_ix ] = min( [ abs( centre_window_dist( 1 ) ) abs( centre_window_dist( 2 ) ) ] ) ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ) += train_vec( ii ) ;
                    assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ;
                    assigned_cluster_centre_ix( assigned_ix ) = ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ;
                   elseif ( isempty( assigned_ix ) ) ## but cheap insurance
                    [ ~ , min_val_ix ] = min( abs( centre_window_dist ) ) ;
                    new_train_vec( ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ) += train_vec( ii ) ;
                    assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + match_max_val_ix( min_val_ix ) ;
                   else
                    error( 'There is an unknown error in instance of ==2 matching max_vals with unequal distances. Write the code to deal with this edge case.' ) ;
                   endif

           endif ## 

    elseif ( numel( match_max_val_ix ) > 2  ) ## There is an instance of >2 matching max_vals.
    ## There must be one max val closer than the others or two equally close
     centre_window_dist = abs( ( bandwidth + 1 ) .- match_max_val_ix ) ;
     centre_window_dist_min = min( centre_window_dist ) ;
     centre_window_dist_min_ix = find( centre_window_dist == centre_window_dist_min ) ;
     
       if ( numel( centre_window_dist_min_ix ) == 1 ) ## there is one closet ix
        assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
        
            if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ) += train_vec( ii ) ;  
               assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ;
               assigned_cluster_centre_ix( assigned_ix ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ;
            elseif ( isempty( assigned_ix ) ) ## but cheap insurance
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ) += train_vec( ii ) ;  
               assigned_cluster_centre_ix( ii ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix ;
            endif
         
       elseif ( numel( centre_window_dist_min_ix ) == 2 ) ## there are 2 equally close ix
        assigned_ix = find( assigned_cluster_centre_ix == ii ) ;
        
            if ( !isempty( assigned_ix ) ) ## should always be true because at least the one original ii == ii
               ix_before = find( assigned_ix < ii ) ;
               ix_after = find( assigned_ix > ii ) ;
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 1 ) ) += train_vec( ii ) / 2 ;
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 2 ) ) += train_vec( ii ) / 2 ;
               assigned_cluster_centre_ix( assigned_ix( ix_before ) ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 1 ) ;
               assigned_cluster_centre_ix( assigned_ix( ix_after ) ) = ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 2 ) ;
               assigned_cluster_centre_ix( ii ) = ii ; ## bit of a kluge             
            elseif ( isempty( assigned_ix ) ) ## but cheap insurance 
               ## no other assigned_cluster_centre_ix values to account for, so just split equally
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 1 ) ) += train_vec( ii ) / 2 ;
               new_train_vec( ii - ( bandwidth + 1 ) + centre_window_dist_min_ix( 2 ) ) += train_vec( ii ) / 2 ;
               assigned_cluster_centre_ix( ii ) = ii ; ## bit of a kluge
            endif
        
       else
        error( 'Unknown error in numel( match_max_val_ix ) > 2.' ) ;
       endif
     ##error( 'There is an instance of >2 matching max_vals. Write the code to deal with this edge case.' ) ;
    else
     error( 'There is an unknown error in instance of >2 matching max_vals. Write the code to deal with this edge case.' ) ;
    endif

  endif ## main if end

endfor

if ( sum( ( train_vec == new_train_vec ) ) == length_train_vec )
 has_converged = 1 ;
else
 train_vec = new_train_vec ;
endif

endwhile

cluster_centre_ix = unique( assigned_cluster_centre_ix ) ;
cluster_centre_ix( cluster_centre_ix == 0 ) = [] ;

endfunction

The reason for this re-write was to accommodate a cross validation routine, which is described in the paper Cluster Validation by Prediction Strength, and a simple outline of which is given in this stackexchange.com answer.

My Octave code implementation of this is shown in the code box below. This is not exactly as described in the above paper because the number of clusters, K, is not exactly specified due to the above function automatically determining K based on the data. The routine below is perhaps more accurately described as being inspired by the original paper.

## create train and test data sets
########## UNCOMMENT AS NECESSARY #####
time_ix = [ 1 : 198 ] ; ## Monday
##time_ix = [ 115 : 342 ] ; ## Tuesday
##time_ix = [ 259 : 486 ] ; ## Wednesday
##time_ix = [ 403 : 630 ] ; ## Thursday
##time_ix = [ 547 : 720 ] ; ## Friday
##time_ix = [ 1 : 720 ] ; ## all data
#######################################

all_cv_solutions = zeros( size( data_matrix , 3 ) , size( data_matrix , 3 ) ) ;

n_iters = 1 ;
for iter = 1 : n_iters ## related to # of rand_ix sets generated 
 
rand_ix = randperm( size( data_matrix , 1 ) ) ; 
train_ix = rand_ix( 1 : round( numel( rand_ix ) * 0.5 ) ) ;
test_ix = rand_ix( round( numel( rand_ix ) * 0.5 ) + 1 : end ) ;
train_data_matrix = sum( data_matrix( train_ix , time_ix , : ) ) ;
test_data_matrix = sum( data_matrix( test_ix , time_ix , : ) ) ;

all_proportions_indicated = zeros( 1 , size( data_matrix , 3 ) ) ;

for cv_ix = 1 : size( data_matrix , 3 ) ; ## related to delta_turning_point_filter n_bar parameter

for bandwidth = 1 : size( data_matrix , 3 )

## train set clustering
if ( bandwidth == 1 )
[ train_out , cluster_centre_ix_train , assigned_cluster_centre_ix_train ] = blurred_maxshift_1d_linear_V2( train_data_matrix(:,:,cv_ix) , bandwidth ) ;
elseif( bandwidth > 1 )
[ train_out , cluster_centre_ix_train , assigned_cluster_centre_ix_train ] = blurred_maxshift_1d_linear_V2( train_out , bandwidth ) ;
endif
train_out_pairs_mat = zeros( numel( assigned_cluster_centre_ix_train ) , numel( assigned_cluster_centre_ix_train ) ) ;
 for ii = 1 : numel( cluster_centre_ix_train )
  cc_ix = find( assigned_cluster_centre_ix_train == cluster_centre_ix_train( ii ) ) ;
  train_out_pairs_mat( cc_ix , cc_ix ) = 1 ;
 endfor
train_out_pairs_mat = triu( train_out_pairs_mat , 1 ) ; ## get upper diagonal matrix

## test set clustering
if ( bandwidth == 1 )
[ test_out , cluster_centre_ix_test , assigned_cluster_centre_ix_test ] = blurred_maxshift_1d_linear_V2( test_data_matrix(:,:,cv_ix) , bandwidth ) ;
elseif( bandwidth > 1 )
[ test_out , cluster_centre_ix_test , assigned_cluster_centre_ix_test ] = blurred_maxshift_1d_linear_V2( test_out , bandwidth ) ;
endif

all_test_out_pairs_mat = zeros( numel( assigned_cluster_centre_ix_test ) , numel( assigned_cluster_centre_ix_test ) ) ;
test_out_pairs_clusters_proportions = ones( 1 , numel( cluster_centre_ix_test ) ) ;
 for ii = 1 : numel( cluster_centre_ix_test )
  cc_ix = find( assigned_cluster_centre_ix_test == cluster_centre_ix_test( ii ) ) ;
  all_test_out_pairs_mat( cc_ix , cc_ix ) = 1 ;
  test_out_pairs_mat = all_test_out_pairs_mat( cc_ix , cc_ix ) ;
  test_out_pairs_mat = triu( test_out_pairs_mat , 1 ) ; ## get upper diagonal matrix
  test_out_pairs_mat_sum = sum( sum( test_out_pairs_mat ) ) ;
   if ( test_out_pairs_mat_sum > 0 )
   test_out_pairs_clusters_proportions( ii ) = sum( sum( train_out_pairs_mat( cc_ix , cc_ix ) .* test_out_pairs_mat ) ) / ...
                                                test_out_pairs_mat_sum ;
   endif
 endfor

all_proportions_indicated( bandwidth ) = min( test_out_pairs_clusters_proportions ) ;
all_cv_solutions( bandwidth , cv_ix ) += all_proportions_indicated( bandwidth ) ; 

endfor ## bandwidth for loop

endfor ## end of cv_ix loop

endfor ## end of iter for

all_cv_solutions = all_cv_solutions ./ n_iters ;

surf( all_cv_solutions ) ; xlabel( 'BANDWIDTH' , 'fontsize' , 20 ) ; ylabel( 'CV IX' , 'fontsize' , 20 ) ;
I won't discuss the workings of the code any further as readers are free to read the original paper and my code interpretation of it. A typical surface plot of the output is shown below.

The "bandwidth" plotted along the front edge of the surface plot is one of the input parameters to the "blurred_maxshift_1d_linear" function, whilst the "lookback" is a parameter of the original data generating function which identifies local highs and lows in a price time series. There appears to be a distinct "elbow" at "lookback" = 6 which is more or less consistent for all values of "bandwidth." Since the underlying data for this is 10 minute OHLC bars, the ideal "lookback" would, therefore, appear to be on the hourly timeframe.

However, having spent some considerable time and effort to get the above working satisfactorily, I am now not so sure that I'll actually use the above code. The reason for this is shown in the following animated GIF.

This shows K segmentation of the exact same data used above, from K = 1 to 19 inclusive, using R and its Ckmeans.1d.dp package, with vignette tutorial here. I am particularly attracted to this because of its speed, compared to my code above, as well as its guarantees with regard to optimality and reproducability. If one stares at the GIF for long enough one can see possible, significant clusters at index values (x-axis) which correspond, approximately, to particularly significant times such as London and New York market opening and closing times: ix = 55, 85, 115 and 145.

More about this in my next post. 

No comments: