Saturday 14 November 2020

Temporal Clustering, Part 3

Continuing on with the subject matter of my last post, in the code box below there is R code which is a straight forward refactoring of the Octave code contained in the second code box of my last post. This code is my implementation of the cross validation routine described in the paper Cluster Validation by Prediction Strength, but adapted for use in the one dimensional case. I have refactored this into R code so that I can use the Ckmeans.1d.dp package for optimal, one dimensional clustering.

library( Ckmeans.1d.dp )

## load the training data from Octave output (comment out as necessary )
data = read.csv( "~/path/to//all_data_matrix" , header = FALSE )

## comment out as necessary
adjust = 0 ## default adjust value
sum_seq = seq( from = 1 , to = 198 , by = 1 ) ; adjust = 1 ; sum_seq_l = as.numeric( length( sum_seq ) )## Monday
##sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Tuesday
##sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Wednesday
##sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Thursday
##sum_seq = seq( from = 547 , to = 720 , by = 1 ) ; adjust = 2 ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Friday

## intraday --- commnet out or adjust as necessary
##sum_seq = seq( from = 25 , to = 100 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) )

upper_tri_mask = 1 * upper.tri( matrix( 0L , nrow = sum_seq_l , ncol = sum_seq_l ) , diag = FALSE )
no_sample_iters = 1000
max_K = 20
all_k_ps = matrix( 0L , nrow = 1 , ncol = max_K )

for ( iters in 1 : no_sample_iters ) {

## sample the data in data by rows
train_ix = sample( nrow( data ) , size = round( nrow( data ) / 2 ) , replace = FALSE )
train_data = data[ train_ix , sum_seq ] ## extract training data using train_ix rows of data
train_data_sum = colSums( train_data )  ## sum down the columns of train_data
test_data = data[ -train_ix , sum_seq ] ## extract test data using NOT train_ix rows of data
test_data_sum = colSums( test_data )    ## sum down the columns of test_data
## adjust for weekend if necessary
if ( adjust == 1 ) { ## Monday, so correct artifacts of weekend gap
  train_data_sum[ 1 : 5 ] = mean( train_data_sum[ 1 : 48 ] )
  test_data_sum[ 1 : 5 ] = mean( test_data_sum[ 1 : 48 ] )   
} else if ( adjust == 2 ) { ## Friday, so correct artifacts of weekend gap
  train_data_sum[ ( sum_seq_l - 4 ) : sum_seq_l ] = mean( train_data_sum[ ( sum_seq_l - 47 ) : sum_seq_l ] )
  test_data_sum[  ( sum_seq_l - 4 ) : sum_seq_l ] = mean( test_data_sum[ ( sum_seq_l - 47 ) : sum_seq_l ] ) 
}

for ( k in 1 : max_K ) {
  
## K segment train_data_sum
train_res = Ckmeans.1d.dp( sum_seq , k , train_data_sum )
train_out_pairs_mat = matrix( 0L , nrow = sum_seq_l , ncol = sum_seq_l )

## K segment test_data_sum
test_res = Ckmeans.1d.dp( sum_seq , k , test_data_sum )
test_out_pairs_mat = matrix( 0L , nrow = sum_seq_l , ncol = sum_seq_l )

  for ( ii in 1 : length( train_res$centers ) ) {
    ix = which( train_res$cluster == ii )
    train_out_pairs_mat[ ix , ix ] = 1 
    ix = which( test_res$cluster == ii )
    test_out_pairs_mat[ ix , ix ] = 1
    }
  ## coerce to upper triangular matrix
  train_out_pairs_mat = train_out_pairs_mat * upper_tri_mask
  test_out_pairs_mat = test_out_pairs_mat * upper_tri_mask
  
  ## get minimum co-membership cluster proportion
  sample_min_vec = matrix( 0L , nrow = 1 , ncol = length( test_res$centers ) )
  for ( ii in 1 : length( test_res$centers ) ) {
    ix = which( test_res$cluster == ii )
    test_cluster_sum = sum( test_out_pairs_mat[ ix , ix ] )
    train_cluster_sum = sum( test_out_pairs_mat[ ix , ix ] * train_out_pairs_mat[ ix , ix ] )
    sample_min_vec[ , ii ] = train_cluster_sum / test_cluster_sum
  }
  
  ## get min of sample_min_vec
  min_val = min( sample_min_vec[ !is.nan( sample_min_vec ) ] ) ## removing any NaN
  all_k_ps[ , k ] = all_k_ps[ , k ] + min_val

} ## end of K for loop

} ## end of sample loop

all_k_ps = all_k_ps / no_sample_iters ## average values
plot( 1 : length( all_k_ps ) , all_k_ps , "b" , xlab = "Number of Clusters K" , ylab = "Prediction Strength Value" )
abline( h = 0.8 , col = "red" )

The purpose of the cross validation routine is to select the number of clusters K, in the model selection sense, that is best supported by the available data. The above linked paper suggests that the optimal number of clusters K is the highest number K that has a prediction strength value over some given threshold (e.g. 0.8 or 0.9). The last part of the code plots the values of prediction strength for K (x-axis) vs. prediction strength (y-axis), along with the threshold value of 0.8 in red. For the particular set of data in question, it can be seen that the optimal K value for the number of clusters is 8.

This second code box shows code, re-using some of the above code, to visualise the clusters for a given K,
library( Ckmeans.1d.dp )

## load the training data from Octave output (comment out as necessary )
data = read.csv( "~/path/to/all_data_matrix" , header = FALSE )
data_sum = colSums( data ) ## sum down the columns of data
data_sum[ 1 : 5 ] = mean( data_sum[ 1 : 48 ] ) ## correct artifacts of weekend gap
data_sum[ 716 : 720 ] = mean( data_sum[ 1 : 48 ] ) ## correct artifacts of weekend gap

## comment out as necessary
adjust = 0 ## default adjust value
sum_seq = seq( from = 1 , to = 198 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Monday
##sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Tuesday
# sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Wednesday
# sum_seq = seq( from = 115 , to = 342 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Thursday
##sum_seq = seq( from = 547 , to = 720 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) ) ## Friday

## intraday --- commnet out or adjust as necessary
##sum_seq = seq( from = 25 , to = 100 , by = 1 ) ; sum_seq_l = as.numeric( length( sum_seq ) )

k = 8
res = Ckmeans.1d.dp( sum_seq , k , data_sum[ sum_seq ] )

plot( sum_seq , data_sum[ sum_seq ], main = "Cluster centres. Cluster centre ix is a predicted turning point",
     col = res$cluster,
     pch = res$cluster, type = "h", xlab = "Count from beginning ix at ix = 1",
     ylab = "Total Counts per ix" )

abline( v = res$centers, col = "chocolate" , lty = "dashed" )

text( res$centers, max(data_sum[sum_seq]) * 0.95, cex = 0.75, font = 2,
      paste( round(res$centers) ) )
a typical plot for which is shown below.
The above plot can be thought of as a clustering at a particular scale, and one can go down in scale by selecting smaller ranges of the data. For example, taking all the datum clustered in the 3 clusters centred at x-axis ix values 38, 63 and 89 and re-running the code in the first code box on just this data gives this prediction strength plot, which suggests a K value of 6.
Re-running the code in the second code box plots these 6 clusters thus.

Looking at this last plot, it can be seen that there is a cluster at x-axis ix value 58, which corresponds to 7.30 a.m. London time, and within this green cluster there are 2 distinct peaks which correspond to 7.00 a.m. and 8.00 a.m. A similar, visual analysis of the far right cluster, centre ix = 94, shows a peak at the time of the New York open.

My hypothesis is that by clustering in the above manner it will be possible to identify distinct, intraday times at which the probability of a market turn is greater than at other times. More in due course.

No comments: