Dekalog Blog: Delay Embedding

Showing posts with label Delay Embedding. Show all posts

Thursday, 18 June 2020

More Work on RVFL Networks

Back in November last year I posted about Random Vector Functional Link (RVFL) networks here and here. Since then, along with my recent work on Oanda's API Octave functions and Market/Volume Profile visualisation, I have continued looking at RVFL networks and this post is an update on this work.

The "random" in RVFL means random initialisation of weights that are then fixed. It seems to me that it might be possible to do better than random by having some principled way of weight initialisation. To this end I have used the Penalised MATLAB Toolbox on features derived from my ideal cyclic tau embedding function to at first train a Generalized Linear Model with the Lasso penalty and then the Ridge penalty over thousands of sets of Monte Carlo generated, ideal cyclic prices and such prices with trends. The best weights for each set of prices were recorded in an array and then the mean weight (and standard deviation) taken. This set of mean weights is intended to replace the random weights in a RVFL network designed to predict the probability of "price" being at a cyclic turning point using the above cyclic tau embedding features.

Of course these weights could be considered a trained model in and of themselves, and the following screenshots show "out of sample" performance on Monte Carlo generated ideal prices that were not used in the training of the mean weights.

The black line is the underlying cyclic price and the red, blue and green lines are the mean weight model probabilities for cyclic peaks, troughs or neither respectively. Points where the peak/trough probabilities exceed the neither probabilities are marked by the red and blue vertical lines. Similarly, we have prices trending up in a cyclic fashion

and also trending down

In the cases of the last two trending markets only the swing highs and lows are indicated. The reason for this is that during training, based on my "expert knowledge" of the cyclic tau features used, it is unreasonable to expect these features to accurately capture the end of an up leg in a bull trend or the end of a down leg in a bear trend - hence these were not presented as a positive class during training.

As I said above the motivation for this is to get a more meaningful hidden layer in a RVFL network. This hidden layer will consist of seven Sigmoid functions which each give a probability of price being at or not being at a cyclic turn, conditional upon the type of market the input weights were trained on.

Wednesday, 13 November 2019

Random Vector Functional Link Networks

In my last post I briefly mentioned Random Vector Functional Link networks and that this post would be about them, inspired by the fact that the preliminary results of the last post suggest a shallow rather than a deep network structure.

The idea of RVFL networks is over two decades old and has perhaps been over shadowed by the more recent Extreme Learning Machine, although as mentioned in my last post there is some controversy about plagiarism with regard to ELMs. An RVFL network is basically an ELM with additional direct connections from the input layer to the output layer. The connections from the input layer to the single hidden layer are randomly generated and then fixed, the hidden layer is concatenated with the original input layer to form a new layer, H, and the connections from H to the output layer are solved in one step using the Moore Penrose inverse to get the Linear least squares solution, or alternatively using Regularised least squares. The advantage of this closed-form approach is the fast training times compared with more general optimisation routines such as gradient descent.

The above linked paper details a series of comparative tests run over various configurations of RVFL networks over a number of different data sets. Some of the main conclusions drawn are:

the direct links from input to output enhance network performance
whether to include output bias or not is a data dependent tunable factor
the radial basis function for the hidden units always leads to better performance
regularised least squares (ridge regression) performs better than the Moore Penrose inverse

Based on the code provided by the study's authors I have written the following two objective functions for use with the BayesOpt Library.

## Copyright (C) 2019 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{J} =} rvfl_training_of_cyclic_embedding_with_cv (@var{x})
##
## Function for Bayesian training of RVFL networks with fixed parameteres of:
##
##    direct links,
##    radial basis function activation,
##    ridge regression for regularized least squares,
##
## and optimisable parameters of:
##
##    number of neurons in hidden layer,
##    lambda for the least squares regression,
##    scaling of hidden layer inputs,
##    with or without an output bias.
##
## The input X is a vector of 6 values to be optimised by the BayesOpt library
## function 'bayesoptcont.'
## The output J is the Brier Score for the test fold cross validated data.
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2019-11-04

function J = rvfl_training_of_cyclic_embedding_with_cv ( x )
global sample_features ; global sample_targets ;
epsilon = 1e-15 ; ## to ensure log() does not give out a nan

Nfea = size( sample_features , 2 ) ;

## check input x
if ( numel( x ) != 6 )
  error( 'The input vector x must be of length 6.' ) ;
endif

## get the parameters from input x
hidden_layer_size = floor( x( 1 ) ) ;    ## number of neurons in hidden layer
randomisation_type = floor( x( 2 ) ) ;   ## 1 == uniform, 2 == Gaussian
scale_mode = floor( x( 3 ) ) ;           ## 1 will scale the features for all neurons, 2 will scale the features for each hidden
##                                          neuron separately, 3 will scale the range of the randomization for uniform distribution
scale = x( 4 ) ;                         ## Linearly scale the random features before feeding into the nonlinear activation function. 
##                                          In this implementation, we consider the threshold which leads to 0.99 of the maximum/minimum 
##                                          value of the activation function as the saturating threshold.
##                                          scale = 0.9 means all the random features will be linearly scaled
##                                          into 0.9 * [ lower_saturating_threshold , upper_saturating_threshold ].
if_output_bias = floor( x( 5 ) + 0.5 ) ; ## Use output bias, or not? 1 == yes , 0 == no.
lambda = x( 6 ) ;                        ## the regularization coefficient lambda 

length_jj_loop = 25 ;
all_brier_values = zeros( length_jj_loop , 1 ) ;  

##rand( 'seed' , 0 ) ;
##randn( 'seed' , 0 ) ;
##U_sample_targets = unique( sample_targets ) ;
##nclass = numel( U_sample_targets ) ;

##sample_targets_temp = zeros( numel( sample_targets ) , nclass ) ; 
##
#### get the 0 - 1 one hot coding for the target,  
##for i = 1 : nclass
##  idx = sample_targets == U_sample_targets( i ) ;
##  sample_targets_temp( idx , i ) = 1 ;
##endfor

###### information for splitting into training and test sets ###############
ix_positive_targets = find( sample_targets == 1 ) ;
ix_negative_targets = ( 1 : numel( sample_targets ) )' ; 
ix_negative_targets( ix_positive_targets ) = [] ;
## split 20/80
split_no1 = round( 0.2 * numel( ix_positive_targets ) ) ;
split_no2 = round( 0.2 * numel( ix_negative_targets ) ) ;

######### get type of randomisation from input x #################
if ( randomisation_type == 1 ) ## uniform randomisation 
  
  if ( scale_mode == 3 ) ## range scaled for uniform randomisation
     Weight = scale * ( rand( Nfea , hidden_layer_size ) * 2 - 1 ) ; ## scaled uniform random input weights to hidden layer
     Bias = scale * rand( 1 , hidden_layer_size ) ;                  ## scaled random bias weights to hidden layer
  else
     Weight = rand( Nfea , hidden_layer_size ) * 2 - 1 ; ## unscaled random input weights to hidden layer
     Bias = rand( 1 , hidden_layer_size ) ;              ## unscaled random bias weights to hidden layer
  endif
    
elseif ( randomisation_type == 2 ) ## gaussian randomisation
  Weight = randn( Nfea , hidden_layer_size ) ; ## gaussian random input weights to hidden layer
  Bias = randn( 1 , hidden_layer_size ) ;      ## gaussian random bias weights to hidden layer
else
  error( 'only Gaussian and Uniform are supported' )
endif
############################################################################

## Activation Function    
Saturating_threshold = [ -2.1 , 2.1 ] ;
Saturating_threshold_activate = [ 0 , 1 ] ;

for jj = 1 : length_jj_loop

## shuffle
randperm1 = randperm( numel( ix_positive_targets) ) ;
randperm2 = randperm( numel( ix_negative_targets) ) ;
test_ix1 = ix_positive_targets( randperm1( 1 : split_no1 ) ) ;
test_ix2 = ix_negative_targets( randperm2( 1 : split_no2 ) ) ;
test_ix = [ test_ix1 ; test_ix2 ] ;
train_ix1 = ix_positive_targets( randperm1( split_no1 + 1 : end ) ) ;
train_ix2 = ix_negative_targets( randperm2( split_no2 + 1 : end ) ) ;
train_ix = [ train_ix1 ; train_ix2 ] ;

sample_targets_train = sample_targets( train_ix ) ;
sample_features_train = sample_features( train_ix , : ) ;

Nsample = size( sample_features_train , 1 ) ;

Bias_train = repmat( Bias , Nsample , 1 ) ;
H = sample_features_train * Weight + Bias_train ;

if ( scale_mode == 1 )
  ## scale the features for all neurons 
  [ H , k , b ] = Scale_feature( H , Saturating_threshold , scale ) ;
elseif ( scale_mode == 2 ) 
  ## else scale the features for each hidden neuron separately
  [ H , k , b ] = Scale_feature_separately( H , Saturating_threshold , scale ) ;
endif
  
## actual activation, the radial basis function
H = exp( -abs( H ) ) ;

if ( if_output_bias == 1 )
  ## we will use an output bias
  H = [ H , ones( Nsample , 1 ) ] ; 
endif

## the direct link scaling options, concatenate hidden layer and sample_features_train 
if ( scale_mode == 1 )
  ## scale the features for all neurons
  sample_features_train = sample_features_train .* k + b ;
  H = [ H , sample_features_train ] ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  [ sample_features_train , ktr , btr ] = Scale_feature_separately( sample_features_train , Saturating_threshold_activate , scale ) ;
  H = [ H , sample_features_train ] ;
else
  H = [ H , sample_features_train ] ;
endif

H( isnan( H ) ) = 0 ; ## avoids any 'blowups' due to nans in H

## do the regularized least squares for concatenated hidden layer output
## and the original, possibly scaled, input sample_features
if ( hidden_layer_size < Nsample )
 beta = ( eye( size( H , 2 ) ) / lambda + H' * H ) \ H' * sample_targets_train ;
else
 beta = H' * ( ( eye( size( H , 1 ) ) / lambda + H * H' ) \ sample_targets_train ) ; 
endif

############# now the test on test data ####################################
Bias_test = repmat( Bias , numel( sample_targets( test_ix ) ) , 1 ) ;
H_test = sample_features( test_ix , : ) * Weight + Bias_test ;

if ( scale_mode == 1 )
  ## scale the features for all neurons
  H_test = H_test .* k + b ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  nSamtest = size( H_test , 1 ) ; 
  kt = repmat( k , nSamtest , 1 ) ;
  bt = repmat( b , nSamtest , 1 ) ;
  H_test = H_test .* kt + bt ;
endif

## actual activation, the radial basis function
H_test = exp( -abs( H_test ) ) ;

if ( if_output_bias == 1 )
  ## we will use an output bias
  H_test = [ H_test , ones( numel( sample_targets( test_ix ) ) , 1 ) ] ; 
endif

## the direct link scaling options, concatenate hidden layer and sample_features_train 
if ( scale_mode == 1 )
  ## scale the features for all neurons 
  testX_temp = sample_features( test_ix , : ) .* k + b ;
  H_test = [ H_test , testX_temp ] ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  nSamtest = size( H_test , 1 ) ; 
  kt = repmat( ktr , nSamtest , 1 ) ;
  bt = repmat( btr , nSamtest , 1 ) ;
  testX_temp = sample_features( test_ix , : ) .* kt + bt ;   
  H_test = [ H_test , testX_temp ] ; 
else
  H_test = [ H_test , sample_features( test_ix , : ) ] ; 
endif

H_test( isnan( H_test ) ) = 0 ; ## avoids any 'blowups' due to nans in H_test

## get the test predicted target output
test_targets = H_test * beta ;

##Y_temp = zeros( Nsample , 1 ) ;
##% decode the target output
##for i = 1 : Nsample
##    [ maxvalue , idx ] = max( sample_targets_temp( i , : ) ) ;
##    Y_temp( i ) = U_sample_targets( idx ) ;
##endfor

############################################################################

## the final logistic output
final_output = 1.0 ./ ( 1.0 .+ exp( -test_targets ) ) ;

## get the Brier_score
## https://en.wikipedia.org/wiki/Brier_score
all_brier_values( jj ) = mean( ( final_output .- sample_targets( test_ix ) ) .^ 2 ) ;

rand( 'state' ) ; randn( 'state' ) ; ## reset rng

endfor ## end of jj loop

J = mean( all_brier_values ) ;

endfunction

## Various measures of goodness
## https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models
## https://www.fharrell.com/post/classification/
## https://stats.stackexchange.com/questions/433628/what-is-a-reliable-measure-of-accuracy-for-logistic-regression
## https://www.jstatsoft.org/article/view/v090i12
## https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
## https://stats.stackexchange.com/questions/319666/aic-with-test-data-is-it-possible
## https://www.learningmachines101.com/lm101-076-how-to-choose-the-best-model-using-aic-or-gaic/
## https://stackoverflow.com/questions/48185090/how-to-get-the-log-likelihood-for-a-logistic-regression-model-in-sklearn
## https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients
## https://stats.stackexchange.com/questions/163221/whats-the-measure-to-assess-the-binary-classification-accuracy-for-imbalanced-d
## https://stats.stackexchange.com/questions/168929/logistic-regression-is-predicting-all-1-and-no-0
## https://stats.stackexchange.com/questions/435307/multiple-linear-regression-lse-when-one-of-parameter-is-known

These functions stand on the shoulders of the above and hard code direct links, radial basis function activation and ridge regression, with the number of neurons in the hidden layer, lambda for the ridge regression, different scaling options and inclusion of output bias or not as the optimisable parameters. The function minimisation objective is the Brier score.

This second function is slightly different in that the Akaike information criterion is the minimisation objective and there is the option to use the Netlab Generalised linear model function to solve for the hidden to output weights (comment out the relevant code as necessary.)

## Copyright (C) 2019 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{J} =} rvfl_training_of_cyclic_embedding (@var{x})
##
## Function for Bayesian training of RVFL networks with fixed parameteres of:
##
##    direct links,
##    radial basis function activation,
##    ridge regression for regularized least squares,
##
## and optimisable parameters of:
##
##    number of neurons in hidden layer,
##    lambda for the least squares regression,
##    scaling of hidden layer inputs,
##    with or without an output bias.
##
## The input X is a vector of 6 values to be optimised by the BayesOpt library
## function 'bayesoptcont.'
## The output J is the AIC value for the tested model.
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2019-11-04

function J = rvfl_training_of_cyclic_embedding ( x )
global sample_features ; global sample_targets ;
epsilon = 1e-15 ; ## to ensure log() does not give out a nan

## check input x
if ( numel( x ) != 4 )
  error( 'The input vector x must be of length 6.' ) ;
endif

## get the parameters from input x
hidden_layer_size = floor( x( 1 ) ) ;    ## number of neurons in hidden layer
randomisation_type = floor( x( 2 ) ) ;   ## 1 == uniform, 2 == Gaussian
scale_mode = floor( x( 3 ) ) ;           ## 1 will scale the features for all neurons, 2 will scale the features for each hidden
##                                          neuron separately, 3 will scale the range of the randomization for uniform distribution
scale = x( 4 ) ;                         ## Linearly scale the random features before feeding into the nonlinear activation function. 
##                                          In this implementation, we consider the threshold which leads to 0.99 of the maximum/minimum 
##                                          value of the activation function as the saturating threshold.
##                                          scale = 0.9 means all the random features will be linearly scaled
##                                          into 0.9 * [ lower_saturating_threshold , upper_saturating_threshold ].
##if_output_bias = floor( x( 5 ) + 0.5 ) ; ## Use output bias, or not? 1 == yes , 0 == no.
##lambda = x( 6 ) ;                        ## the regularization coefficient lambda 

##length_jj_loop = 25 ;
##all_aic_values = zeros( length_jj_loop , 1 ) ;  

rand( 'seed' , 0 ) ;
randn( 'seed' , 0 ) ;
##U_sample_targets = unique( sample_targets ) ;
##nclass = numel( U_sample_targets ) ;

##sample_targets_temp = zeros( numel( sample_targets ) , nclass ) ; 
##
#### get the 0 - 1 one hot coding for the target,  
##for i = 1 : nclass
##  idx = sample_targets == U_sample_targets( i ) ;
##  sample_targets_temp( idx , i ) = 1 ;
##endfor

sample_targets_temp = sample_targets ;

[ Nsample , Nfea ] = size( sample_features ) ;

######### get type of randomisation from input x #################
if ( randomisation_type == 1 ) ## uniform randomisation 
  
  if ( scale_mode == 3 ) ## range scaled for uniform randomisation
     Weight = scale * ( rand( Nfea , hidden_layer_size ) * 2 - 1 ) ; ## scaled uniform random input weights to hidden layer
     Bias = scale * rand( 1 , hidden_layer_size ) ;                  ## scaled random bias weights to hidden layer
  else
     Weight = rand( Nfea , hidden_layer_size ) * 2 - 1 ; ## unscaled random input weights to hidden layer
     Bias = rand( 1 , hidden_layer_size ) ;              ## unscaled random bias weights to hidden layer
  endif
    
elseif ( randomisation_type == 2 ) ## gaussian randomisation
  Weight = randn( Nfea , hidden_layer_size ) ; ## gaussian random input weights to hidden layer
  Bias = randn( 1 , hidden_layer_size ) ;      ## gaussian random bias weights to hidden layer
else
  error( 'only Gaussian and Uniform are supported' )
endif
############################################################################

Bias_train = repmat( Bias , Nsample , 1 ) ;
H = sample_features * Weight + Bias_train ;

k_parameters = numel( Weight ) + numel( Bias_train ) ;

## Activation Function    
Saturating_threshold = [ -2.1 , 2.1 ] ;
Saturating_threshold_activate = [ 0 , 1 ] ;

if ( scale_mode == 1 )
  ## scale the features for all neurons 
  [ H , k , b ] = Scale_feature( H , Saturating_threshold , scale ) ;
elseif ( scale_mode == 2 ) 
  ## else scale the features for each hidden neuron separately
  [ H , k , b ] = Scale_feature_separately( H , Saturating_threshold , scale ) ;
endif
  
## actual activation, the radial basis function
H = exp( -abs( H ) ) ;

## glm training always applies a bias, so comment out if training with netlab glm
##if ( if_output_bias == 1 )
##  ## we will use an output bias
##  H = [ H , ones( Nsample , 1 ) ] ; 
##endif

## the direct link scaling options, concatenate hidden layer and sample_features 
if ( scale_mode == 1 )
  ## scale the features for all neurons
  sample_features_temp = sample_features .* k + b ;
  H = [ H , sample_features_temp ] ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  [ sample_features_temp , ktr , btr ] = Scale_feature_separately( sample_features , Saturating_threshold_activate , scale ) ;
  H = [ H , sample_features_temp ] ;
else
  H = [ H , sample_features ] ;
endif

H( isnan( H ) ) = 0 ; ## avoids any 'blowups' due to nans in H

############ THE ORIGINAL REGULARISED LEAST SQUARES CODE ###################
## do the regularized least squares for concatenated hidden layer output            
## and the original, possibly scaled, input sample_features                         
##if ( hidden_layer_size < Nsample )                                                  
## beta = ( eye( size( H , 2 ) ) / lambda + H' * H ) \ H' * sample_targets_temp ;     
##else                                                                                
## beta = H' * ( ( eye( size( H , 1 ) ) / lambda + H * H' ) \ sample_targets_temp ) ; 
##endif                                                                               
############################################################################

##k_parameters = k_parameters + numel( beta ) ;

## get the model predicted target output
##sample_targets_temp = H * beta ;

## the final logistic output
##final_output = 1.0 ./ ( 1.0 .+ exp( -sample_targets_temp ) ) ;

############ REPLACED BY GLM TRAINING USING NETLAB #########################
net = glm( size( H , 2 ) , 1 , 'logistic' ) ; ## Create a generalized linear model structure.
options = foptions ; ## Set default parameters for optimisation routines, for compatibility with MATLAB's foptions()
options( 1 ) = -1 ;   ## change default value
## OPTIONS(1) is set to 1 to display error values during training. If
## OPTIONS(1) is set to 0, then only warning messages are displayed.  If
## OPTIONS(1) is -1, then nothing is displayed.
options( 14 ) = 5 ; ## change default value
## OPTIONS(14) is the maximum number of iterations for the IRLS
## algorithm;  default 100.
net = glmtrain( net , options , H , sample_targets ) ;

k_parameters = k_parameters + net.nwts ;

## get output of trained glm model
final_output = glmfwd( net , H ) ;
############################################################################

##Y_temp = zeros( Nsample , 1 ) ;
##% decode the target output
##for i = 1 : Nsample
##    [ maxvalue , idx ] = max( sample_targets_temp( i , : ) ) ;
##    Y_temp( i ) = U_sample_targets( idx ) ;
##endfor

############################################################################

##  https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/
##
##  likelihood = yhat * y + (1 – yhat) * (1 – y)
##
##  We can update the likelihood function using the log to transform it into a log-likelihood function:
##
##      log-likelihood = log(yhat) * y + log(1 – yhat) * (1 – y)

##  Finally, we can sum the likelihood function across all examples in the dataset to maximize the likelihood:
##
##      maximize sum i to n log(yhat_i) * y_i + log(1 – yhat_i) * (1 – y_i)

log_likelihood = sum( log( final_output .+ epsilon ) .* sample_targets + log( 1 .- final_output .+ epsilon ) .* ( 1 .- sample_targets ) ) ;

## get Akaike Information criteria
J = 2 * k_parameters - 2 * log_likelihood ;

## get the Brier_score
## https://en.wikipedia.org/wiki/Brier_score
##J = mean( ( final_output .- sample_targets_temp ) .^ 2 ) ;

rand( 'state' ) ; randn( 'state' ) ; ## reset rng

endfunction

## Various measures of goodness
## https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models
## https://www.fharrell.com/post/classification/
## https://stats.stackexchange.com/questions/433628/what-is-a-reliable-measure-of-accuracy-for-logistic-regression
## https://www.jstatsoft.org/article/view/v090i12
## https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
## https://stats.stackexchange.com/questions/319666/aic-with-test-data-is-it-possible
## https://www.learningmachines101.com/lm101-076-how-to-choose-the-best-model-using-aic-or-gaic/
## https://stackoverflow.com/questions/48185090/how-to-get-the-log-likelihood-for-a-logistic-regression-model-in-sklearn
## https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients
## https://stats.stackexchange.com/questions/163221/whats-the-measure-to-assess-the-binary-classification-accuracy-for-imbalanced-d
## https://stats.stackexchange.com/questions/168929/logistic-regression-is-predicting-all-1-and-no-0
## https://stats.stackexchange.com/questions/435307/multiple-linear-regression-lse-when-one-of-parameter-is-known

Both of these functions are working code and are heavily commented and perhaps not very polished.

As I write this post I have various tests running in the background and will report on the results in due course.

Wednesday, 30 October 2019

Weight Agnostic Neural Net Training

I have recently come across the idea of weight agnostic neural net training and have implemented a crude version of this combined with the recent work I have been doing on Taken's Theorem ( see my posts here, here and here ) and using the statistical mechanics approach to creating synthetic data.

Using the simple Octave function below with the Akaike Information Criterion as the minimisation objective

## Copyright (C) 2019 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{J} =} wann_training_of_cyclic_embedding()
##
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2019-10-26

function J = wann_training_of_cyclic_embedding( x )
global sample_features ; global sample_targets ;
epsilon = 1e-15 ; ## to ensure log() does not give out a nan

## get the parameters from input x
activation_funcs = floor( x( 1 : 5 ) ) ; ## get the activations, 1 == sigmoid, 2 == tanh, 3 == Lecun sigmoid
layer_size = floor( x( 6 : 10 ) ) ;

[ min_layer_size , ix_min ] = min( layer_size ) ;
  if( min_layer_size > 0 ) ## to be expected most of the time
    nn_depth = length( layer_size ) ;
  elseif( min_layer_size == 0 ) ## one layer has no nodes, hence limits depth of nn
    nn_depth = ix_min - 1 ;
  endif 

length_jj_loop = 25 ;
all_aic_values = zeros( length_jj_loop , 1 ) ;  
  
for jj = 1 : length_jj_loop 
  
  previous_layer_out = sample_features ;
  sum_of_k = 0 ;
    
  for ii = 1 : nn_depth
    
    new_weight_matrix = ones( size( previous_layer_out , 2 ) , layer_size( ii ) ) ./ sqrt( size( previous_layer_out , 2 ) ) ;
    sum_of_k = sum_of_k + numel( new_weight_matrix ) ;
    prior_to_activation_input = previous_layer_out * new_weight_matrix ;

    ## select the activation function 
    if( activation_funcs( ii ) == 1 ) ## sigmoid activation
      previous_layer_out = 1.0 ./ ( 1.0 .+ exp( -prior_to_activation_input ) ) ;
    elseif( activation_funcs( ii ) == 2 ) ## tanh activation
      previous_layer_out = tanh( prior_to_activation_input ) ;
    elseif( activation_funcs( ii ) == 3 ) ## lecun sigmoid activation
      previous_layer_out = sigmoid_lecun_m( prior_to_activation_input ) ;
    endif 
    
  endfor

  ## the final logistic output
  new_weight_matrix = ones( size( previous_layer_out , 2 ) , 1 ) ./ sqrt( size( previous_layer_out , 2 ) ) ;
  sum_of_k = sum_of_k + numel( new_weight_matrix ) ;
  final_output = previous_layer_out * new_weight_matrix ;
  final_output = 1.0 ./ ( 1.0 .+ exp( -final_output ) ) ;

  max_likelihood = sum( log( final_output .+ epsilon ) .* sample_targets + log( 1 .- final_output .+ epsilon ) .* ( 1 .- sample_targets ) ) ;
  
  ## get Akaike Information criteria
  all_aic_values( jj ) = 2 * sum_of_k - 2 * max_likelihood ;

endfor ## end of jj loop

J = mean( all_aic_values ) ;

endfunction

and the Octave interface of the Bayesopt Library I am currently iterating over different architectures ( up to 5 hidden layers deep with a max of 100 nodes per layer and using a choice of 3 hidden activations ) for a simple Logistic Regression model to predict turning points in different sets of statistical mechanics synthetic data using just features based on Taken's embedding.

More in due course.

Tuesday, 1 October 2019

Ideal Cyclic Tau Embedding as Times Series Features

Continuing on from my Ideal Tau for Time Series Embedding post, I have now written an Octave function based on these ideas to produce features for time series modelling. The function outputs are two slightly different versions of features, examples of which are shown in the following two plots, which show up and down trends in black, following a sinusoidal sideways market partially visible to the left:

and

The function outputs are normalised price and price delayed by a quarter and a half of the cycle period (in this case 20.) The trend slopes have been chosen to exemplify the difference in the features between up and down trends. The function outputs are identical for the unseen cyclic prices to the left.

When the sets of function outputs are plotted as 3D phase space plots typical results are

with the green, blue and red phase trajectories corresponding to the cyclic, up trending and down trending portions of the above synthetic price series. The markers in this plot correspond to points in the phase trajectories at which turns in the underlying price series occur. The following plot is the above plot rotated in phase space such that the green cyclic price phase trajectory is horizontally orientated.

It is clearer in this second phase space plot that there is a qualitative difference in the phase space trajectories of the different, underlying market types.

As a final check of feature relevance I used the Boruta R package, with the above turning point markers as classification targets (a turning point vs. not a turning point,) to assess the utility of this approach in general. These tests were conducted on real price series and also on indices created by my currency strength methodology. In all such Boruta tests the features are deemed "relevant" up to a delay of five bars on daily data (I ceased the tests at the five bar mark because it was becoming tedious to carry on.)

In summary, therefore, it can be concluded that features derived using Taken's theorem are useful on financial time series provided that:

the underlying time series are normalised (detrended) and,
the embedding delay (Tau) is set to the theoretical optimum of a quarter (and multiples thereof) of the measured cyclic period of the time series.

Monday, 16 September 2019

The Ideal Tau for Time Series Embedding?

In my Preliminary Test Results of Time Series Embedding post I got a bit ahead of myself and mistakenly quoted the ideal embedding length (Tau) to be half a period for cyclic prices. This should actually have been a quarter of a period and I have now corrected my earlier post.

This post is about my further investigations, which uncovered the above, and is written in a logical progression rather than the order in which I actually did things.

It was important to look at the Tau value for ideal sinusoidal data to confirm the quarter period, which checked out because I was able to easily recreate the 2D Lissajous curves in the Novel Method for Topological Embedding of Time-Series Data paper. I then extended this approach to 3D to produce plots such as this one, which may require some explaining.

This is a 3D Phase space plot of adaptive Max Min Normalised "sinusoidal prices" normalised to fall in the range [ -1 +1 ] and adaptive to the known, underlying period of the sine wave. This was done for prices with and without a "trend." The circular orbits are for the pure sine wave, representing cyclic price action plotted in 3D as current price, price delayed by a quarter of a cycle (Tau 1) and by half a cycle (Tau 2). The mini, blue and red "constellations" outside the circular orbits are the same pure sine wave(s) with an uptrend and downtrend added. The utility of this representation for market classification etc. is obvious, but is not the focus of this post.

Looking at Tau values for ideal prices with various trends taken from real market data, normalised as described above, across a range of known periods (because they were synthetically generated) the following results were obtained.

This is a plot of periods 10 to 47 (left to right along the x-axis) vs the global tau value (y-axis) measured using my Octave version of the mdDelay.m function from the mdembedding github. Each individual, coloured line plot is based on the underlying trend of one of 66 different, real time series. Each point per period per tradeable is the median value of 20 Monte Carlo runs over that tradeable's trend + synthetic, ideal sine wave price. The thick, black line is the median of the median values per period. Although noisy, it can be seen that the median line(s) straddle the theoretical, quarter cycle Tau (0.25) quite nicely. It should be remembered that this is on normalised prices. For comparison, the below plot shows the same on the unnormalised prices.

where the Tau value starts at over 0.5 at a cyclic period of 10 at the left and then slowly decreases to just over 0.1 at period 47.

More in due course.

Pages