Showing posts with label Neural nets. Show all posts
Showing posts with label Neural nets. Show all posts

Wednesday, 3 September 2025

Expressing an Indicator in Neural Net Form

Recently I started investigating relative rotation graphs with a view to perhaps implementing a version of this for use on forex currency pairs. The underlying idea of a relative rotation graph is to plot an asset's relative strength compared to a benchmark and the momentum of this relative strength and to plot this in a form similar to a Polar coordinate system plot, which rotates around a zero point representing zero relative strength and zero relative strength momentum. After some thought it occurred to me that rather than using relative strength against a benchmark I could use the underlying relative strengths of the individual currencies, as calculated by my currency strength indicator, against each other. Furthermore, these underlying log changes in the individual currencies can be normalised using the ideas of brownian motion and then averaged together over different look back periods to create a unique indicator.

This indicator was coded up in the "traditional" way that will be familiar to anybody who has ever tried coding a trading indicator in any coding language. However, I thought that if the calculation of this indicator could be expressed in the form of a Feedforward neural network then the optimisation opportunities available using backpropagation and regularization could be used to tweek the indicator in more useful ways than just varying a look back length and amount of averaging. After some work I was able to make this work in just these two lines of Octave code:

indicator_middle_layer = tanh( full_feature_matrix * input_weights ) ;
indicator_nn_output = tanh( indicator_middle_layer * output_weights ) ; 

Of course, prior to calling these two lines of code, there is some feature engineering to create the input full_feature_matrix, and the input weights and output_weights matrices taken together are mathematically equivalent to the original indicator calculations. Finally, because this is a neural net expression of the indicator, the non-linear tanh activation function is applied to the hidden middle and output layers of the net.

The following plot shows the original indicator in black and the neural net version of it in blue 

over the data shown in this plot of 10 minute bars of the EURUSD forex pair. 

The red indicator in the plot above is the 1 bar momentum of the neural net indicator plot.

To judge the quality of this indicator I used the entropy measure (measured over 14,000+ 10 minute bars of EURUSD), the results of which are shown next.

entropy_indicator_original = 0.9485
entropy_indicator_nn_output = 0.9933

An entropy reading of 0.9933 is probably as good as any trading indicator could hope to achieve (a perfect reading is 1.0) and so the next thing was to quickly back test the indicator performance. Based on the logic of the indicator the obvious long (short) signals are being above (below) the zeroline, or equivalently the sign of the indicator, and for good measure I also tested the sign of the momentum and some simple moving averages thereof.

The following plot shows the equity curves of this quick test where it is visually clear that the blue equity curves are "the best" when plotted in relation to the black "buy and hold" equivalent equity curve. These represent the equity curves of a 3 bar simple moving average of the 1 bar momentum of both the original formulation of the indicator and the neural net implementation. I would point out that these equity curves represent the theoretical equity resulting from trading the London session after 9:00am BST and stopping at 7:00am EST (usually about noon BST) and then starting trading again at 9:00am EST until 17:00pm EST. This schedule avoids the opening sessions (7:00 to 9:00am) in both London and New York because, from my observations of many OHLC charts such as shown above, there are frequently wild swings where price is being pushed to significant points such as volume profile clusters, accumulations of buy/sell orders etc. and in my opinion no indicator can be expected to perform well in that sort of environment. Also avoided are the hours prior to 7:00am BST, i.e. the Asian session or overnight session.    

Although these equity curves might not, at first glance, be that impressive, especially as they do not account for trading costs etc. my intent on doing these tests was to determine the configuration of a final "decision" output layer to be added to the neural net implementation of the indicator. A 3 bar simple moving average of the 1 bar momentum implies the necessity to include 4 consecutive readings of the indicator output as input to a final " decision" layer. The following simple, hand-drawn sketch shows what I mean:
A discussion of this will be the subject of my next post.

Wednesday, 28 February 2024

Indicator(s) Derived from PositionBook Data

Since my last post I have been trying to create new indicators from PositionBook data but unfortunately I have had no luck in doing so. I have have tried differences, ratios, cumulative sums, logs and control charts to no avail and I have decided to discontinue this line of investigation because it doesn't seem to hold much promise. The only other direct uses I can think of for this data are:

I am not yet sure which of the above I will look at next, but whichever it is will be the subject of a future post.

Friday, 8 April 2022

Simple Machine Learning Models on OrderBook/PositionBook Features

This post is about using OrderBook/PositionBook features as input to simple machine learning models after previous investigation into the relevance of such features. 

Due to the amount of training data available I decided to look only at a linear model and small neural networks (NN) with a single hidden layer with up to 6 hidden neurons. This choice was motivated by an academic paper I read online about linear models which stated that, as a lower bound, one should have at least 10 training examples for each parameter to be estimated. Other online reading about order flow imbalance (OFI) suggested there is a linear relationship between OFI and price movement. Use of limited size NNs would allow a small amount of non linearity in the relationship. For this investigation I used the Netlab toolbox and Octave. A plot of the learning curves of the classification models tested is shown below. The targets were binary 1/0 for price increases/decreases.

The blue lines show the average training error (y axis) and the red lines show the same average error metric on the held out cross validation data set for each tested model. The thickness of the lines represents the number of neurons in the single hidden layer of the NNs (the thicker the lines, the higher the number of hidden neurons). The horizontal green line shows the error of a generalized linear model (GLM) trained using iteratively reweighted least squares. It can be seen that NN models with 1 and 2 hidden neurons slightly outperform the GLM, with the 2 neuron model having the edge over the 1 neuron model. NN models with 3 or more hidden neurons over fit and underperform the GLM. The NN models were trained using Netlab's functions for Bayesian regularization over the parameters.

Looking at these results it would seem that a 2 neuron NN would be the best choice; however the error differences between the 1 and 2 neuron NNs and GLM are small enough to anticipate that the final classifications (with a basic greater/less than a 0.5 logistic threshold value for long/short) would perhaps be almost identical. 

Investigations into this will be the subject of my next post. 

The code box below gives the working Octave code for the above.

## load data
##training_data = dlmread( 'raw_netlab_training_features' ) ;
##cv_data = dlmread( 'raw_netlab_cv_features' ) ;
training_data = dlmread( 'netlab_training_features_svd' ) ;
cv_data = dlmread( 'netlab_cv_features_svd' ) ;
training_targets = dlmread( 'netlab_training_targets' ) ;
cv_targets = dlmread( 'netlab_cv_targets' ) ;

kk_loop_record = zeros( 30 , 7 ) ;

for kk = 1 : 30
  
## first train a glm model as a base comparison
input_dim = size( training_data , 2 ) ; ## Number of inputs.

net_lin = glm( input_dim , 1 , 'logistic' ) ; ## Create a generalized linear model structure.
options = foptions ; ## Sets default parameters for optimisation routines, for compatibility with MATLAB's foptions()
options(1) = 1 ;  ## change default value
##	OPTIONS(1) is set to 1 to display error values during training. If
##	OPTIONS(1) is set to 0, then only warning messages are displayed.  If
##	OPTIONS(1) is -1, then nothing is displayed.
options(14) = 5 ; ## change default value
##	OPTIONS(14) is the maximum number of iterations for the IRLS
##	algorithm;  default 100.
net_lin = glmtrain( net_lin , options , training_data , training_targets ) ;

## test on cv_data
glm_out = glmfwd( net_lin , cv_data ) ;
## cross-entrophy loss
glm_out_loss = -mean( cv_targets .* log( glm_out )  .+ ( 1 .- cv_targets ) .* log( 1 .- glm_out ) ) ;

kk_loop_record( kk , 7 ) = glm_out_loss ;

## now train an mlp
## Set up vector of options for the optimiser.
nouter = 30 ; ## Number of outer loops.
ninner = 2 ;	## Number of innter loops.
options = foptions ; ## Default options vector.
options( 1 ) = 1 ;	## This provides display of error values.
options( 2 ) = 1.0e-5 ; ## Absolute precision for weights.
options( 3 ) = 1.0e-5 ; ## Precision for objective function.
options( 14 ) = 100 ; ## Number of training cycles in inner loop.

training_learning_curve = zeros( nouter , 6 ) ; 
cv_learning_curve = zeros( nouter , 6 ) ;

for jj = 1 : 6

## Set up network parameters.
nin = size( training_data , 2 ) ; ## Number of inputs.
nhidden = jj ;	## Number of hidden units.
nout = 1 ; ## Number of outputs.
alpha = 0.01 ; ## Initial prior hyperparameter.
aw1 = 0.01 ;
ab1 = 0.01 ;
aw2 = 0.01 ;
ab2 = 0.01 ;

## Create and initialize network weight vector.
prior = mlpprior(nin , nhidden , nout , aw1 , ab1 , aw2 , ab2 ) ;
net = mlp( nin , nhidden , nout , 'logistic' , prior ) ;

## Train using scaled conjugate gradients, re-estimating alpha and beta.
for ii = 1 : nouter
  ## train net
  net = netopt( net , options , training_data , training_targets , 'scg' ) ;
  
  train_out = mlpfwd( net , training_data ) ;
  ## get train error
  ## mse
  ##training_learning_curve( ii ) = mean( ( training_targets .- train_out ).^2 ) ;
  
  ## cross entropy loss
  training_learning_curve( ii , jj ) = -mean( training_targets .* log( train_out )  .+ ( 1 .- training_targets ) .* log( 1 .- train_out ) ) ; 

  cv_out = mlpfwd( net , cv_data ) ;
  ## get cv error
  ## mse
  ##cv_learning_curve( ii ) = mean( ( cv_targets .- cv_out ).^2 ) ;
  
  ## cross entropy loss
  cv_learning_curve( ii , jj ) = -mean( cv_targets .* log( cv_out )  .+ ( 1 .- cv_targets ) .* log( 1 .- cv_out ) ) ; 
  
  ## now update hyperparameters based on evidence
  [ net , gamma ] = evidence( net , training_data , training_targets , ninner ) ;
  
##  fprintf( 1 , '\nRe-estimation cycle ##d:\n' , ii ) ;
##  disp( [ '  alpha = ' , num2str( net.alpha' ) ] ) ;
##  fprintf( 1 , '  gamma =  %8.5f\n\n' , gamma ) ;
##  disp(' ')
##  disp('Press any key to continue.')
  ##pause;
endfor ## ii loop

endfor ## jj loop

kk_loop_record( kk , 1 : 6 ) = cv_learning_curve( end , : ) ;

endfor ## kk loop

plot( training_learning_curve(:,1) , 'b' , 'linewidth' , 1 , cv_learning_curve(:,1) , 'r' , 'linewidth' , 1 , ...
training_learning_curve(:,2) , 'b' , 'linewidth' , 2 , cv_learning_curve(:,2) , 'r' , 'linewidth' , 2 , ...
training_learning_curve(:,3) , 'b' , 'linewidth' , 3 , cv_learning_curve(:,3) , 'r' , 'linewidth' , 3 , ...
training_learning_curve(:,4) , 'b' , 'linewidth' , 4 , cv_learning_curve(:,4) , 'r' , 'linewidth' , 4 , ...
training_learning_curve(:,5) , 'b' , 'linewidth' , 5 , cv_learning_curve(:,5) , 'r' , 'linewidth' , 5 , ...
training_learning_curve(:,6) , 'b' , 'linewidth' , 6 , cv_learning_curve(:,6) , 'r' , 'linewidth' , 6 , ...
ones( size( training_learning_curve , 1 ) , 1 ).*glm_out_loss , 'g' , 'linewidth', 2 ) ;

##  >> mean(kk_loop_record)
##  ans =
##
##     0.6928   0.6927   0.7261   0.7509   0.7821   0.8112   0.6990

##  >> std(kk_loop_record)
##  ans =
##
##     8.5241e-06   7.2869e-06   1.2999e-02   1.5285e-02   2.5769e-02   2.6844e-02   2.2584e-16

Thursday, 18 June 2020

More Work on RVFL Networks

Back in November last year I posted about Random Vector Functional Link (RVFL) networks here and here. Since then, along with my recent work on Oanda's API Octave functions and Market/Volume Profile visualisation, I have continued looking at RVFL networks and this post is an update on this work.

The "random" in RVFL means random initialisation of weights that are then fixed. It seems to me that it might be possible to do better than random by having some principled way of weight initialisation. To this end I have used the Penalised MATLAB Toolbox on features derived from my ideal cyclic tau embedding function to at first train a Generalized Linear Model with the Lasso penalty and then the Ridge penalty over thousands of sets of Monte Carlo generated, ideal cyclic prices and such prices with trends. The best weights for each set of prices were recorded in an array and then the mean weight (and standard deviation) taken. This set of mean weights is intended to replace the random weights in a RVFL network designed to predict the probability of "price" being at a cyclic turning point using the above cyclic tau embedding features.

Of course these weights could be considered a trained model in and of themselves, and the following screenshots show "out of sample" performance on Monte Carlo generated ideal prices that were not used in the training of the mean weights.
The black line is the underlying cyclic price and the red, blue and green lines are the mean weight model probabilities for cyclic peaks, troughs or neither respectively. Points where the peak/trough probabilities exceed the neither probabilities are marked by the red and blue vertical lines. Similarly, we have prices trending up in a cyclic fashion
and also trending down
In the cases of the last two trending markets only the swing highs and lows are indicated. The reason for this is that during training, based on my "expert knowledge" of the cyclic tau features used, it is unreasonable to expect these features to accurately capture the end of an up leg in a bull trend or the end of a down leg in a bear trend - hence these were not presented as a positive class during training.

As I said above the motivation for this is to get a more meaningful hidden layer in a RVFL network. This hidden layer will consist of seven Sigmoid functions which each give a probability of price being at or not being at a cyclic turn, conditional upon the type of market the input weights were trained on.

More in due course.

Wednesday, 13 November 2019

Random Vector Functional Link Networks

In my last post I briefly mentioned Random Vector Functional Link networks and that this post would be about them, inspired by the fact that the preliminary results of the last post suggest a shallow rather than a deep network structure.

The idea of RVFL networks is over two decades old and has perhaps been over shadowed by the more recent Extreme Learning Machine, although as mentioned in my last post there is some controversy about plagiarism with regard to ELMs. An RVFL network is basically an ELM with additional direct connections from the input layer to the output layer. The connections from the input layer to the single hidden layer are randomly generated and then fixed, the hidden layer is concatenated with the original input layer to form a new layer, H, and the connections from H to the output layer are solved in one step using the Moore Penrose inverse to get the Linear least squares solution, or alternatively using Regularised least squares. The advantage of this closed-form approach is the fast training times compared with more general optimisation routines such as gradient descent.

The above linked paper details a series of comparative tests run over various configurations of RVFL networks over a number of different data sets. Some of the main conclusions drawn are:
  1. the direct links from input to output enhance network performance
  2. whether to include output bias or not is a data dependent tunable factor
  3. the radial basis function for the hidden units always leads to better performance
  4. regularised least squares (ridge regression) performs better than the Moore Penrose inverse
Based on the code provided by the study's authors I have written the following two objective functions for use with the BayesOpt Library.
## Copyright (C) 2019 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{J} =} rvfl_training_of_cyclic_embedding_with_cv (@var{x})
##
## Function for Bayesian training of RVFL networks with fixed parameteres of:
##
##    direct links,
##    radial basis function activation,
##    ridge regression for regularized least squares,
##
## and optimisable parameters of:
##
##    number of neurons in hidden layer,
##    lambda for the least squares regression,
##    scaling of hidden layer inputs,
##    with or without an output bias.
##
## The input X is a vector of 6 values to be optimised by the BayesOpt library
## function 'bayesoptcont.'
## The output J is the Brier Score for the test fold cross validated data.
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2019-11-04

function J = rvfl_training_of_cyclic_embedding_with_cv ( x )
global sample_features ; global sample_targets ;
epsilon = 1e-15 ; ## to ensure log() does not give out a nan

Nfea = size( sample_features , 2 ) ;

## check input x
if ( numel( x ) != 6 )
  error( 'The input vector x must be of length 6.' ) ;
endif

## get the parameters from input x
hidden_layer_size = floor( x( 1 ) ) ;    ## number of neurons in hidden layer
randomisation_type = floor( x( 2 ) ) ;   ## 1 == uniform, 2 == Gaussian
scale_mode = floor( x( 3 ) ) ;           ## 1 will scale the features for all neurons, 2 will scale the features for each hidden
##                                          neuron separately, 3 will scale the range of the randomization for uniform distribution
scale = x( 4 ) ;                         ## Linearly scale the random features before feeding into the nonlinear activation function. 
##                                          In this implementation, we consider the threshold which leads to 0.99 of the maximum/minimum 
##                                          value of the activation function as the saturating threshold.
##                                          scale = 0.9 means all the random features will be linearly scaled
##                                          into 0.9 * [ lower_saturating_threshold , upper_saturating_threshold ].
if_output_bias = floor( x( 5 ) + 0.5 ) ; ## Use output bias, or not? 1 == yes , 0 == no.
lambda = x( 6 ) ;                        ## the regularization coefficient lambda 

length_jj_loop = 25 ;
all_brier_values = zeros( length_jj_loop , 1 ) ;  

##rand( 'seed' , 0 ) ;
##randn( 'seed' , 0 ) ;
##U_sample_targets = unique( sample_targets ) ;
##nclass = numel( U_sample_targets ) ;

##sample_targets_temp = zeros( numel( sample_targets ) , nclass ) ; 
##
#### get the 0 - 1 one hot coding for the target,  
##for i = 1 : nclass
##  idx = sample_targets == U_sample_targets( i ) ;
##  sample_targets_temp( idx , i ) = 1 ;
##endfor

###### information for splitting into training and test sets ###############
ix_positive_targets = find( sample_targets == 1 ) ;
ix_negative_targets = ( 1 : numel( sample_targets ) )' ; 
ix_negative_targets( ix_positive_targets ) = [] ;
## split 20/80
split_no1 = round( 0.2 * numel( ix_positive_targets ) ) ;
split_no2 = round( 0.2 * numel( ix_negative_targets ) ) ;

######### get type of randomisation from input x #################
if ( randomisation_type == 1 ) ## uniform randomisation 
  
  if ( scale_mode == 3 ) ## range scaled for uniform randomisation
     Weight = scale * ( rand( Nfea , hidden_layer_size ) * 2 - 1 ) ; ## scaled uniform random input weights to hidden layer
     Bias = scale * rand( 1 , hidden_layer_size ) ;                  ## scaled random bias weights to hidden layer
  else
     Weight = rand( Nfea , hidden_layer_size ) * 2 - 1 ; ## unscaled random input weights to hidden layer
     Bias = rand( 1 , hidden_layer_size ) ;              ## unscaled random bias weights to hidden layer
  endif
    
elseif ( randomisation_type == 2 ) ## gaussian randomisation
  Weight = randn( Nfea , hidden_layer_size ) ; ## gaussian random input weights to hidden layer
  Bias = randn( 1 , hidden_layer_size ) ;      ## gaussian random bias weights to hidden layer
else
  error( 'only Gaussian and Uniform are supported' )
endif
############################################################################

## Activation Function    
Saturating_threshold = [ -2.1 , 2.1 ] ;
Saturating_threshold_activate = [ 0 , 1 ] ;

for jj = 1 : length_jj_loop

## shuffle
randperm1 = randperm( numel( ix_positive_targets) ) ;
randperm2 = randperm( numel( ix_negative_targets) ) ;
test_ix1 = ix_positive_targets( randperm1( 1 : split_no1 ) ) ;
test_ix2 = ix_negative_targets( randperm2( 1 : split_no2 ) ) ;
test_ix = [ test_ix1 ; test_ix2 ] ;
train_ix1 = ix_positive_targets( randperm1( split_no1 + 1 : end ) ) ;
train_ix2 = ix_negative_targets( randperm2( split_no2 + 1 : end ) ) ;
train_ix = [ train_ix1 ; train_ix2 ] ;

sample_targets_train = sample_targets( train_ix ) ;
sample_features_train = sample_features( train_ix , : ) ;

Nsample = size( sample_features_train , 1 ) ;

Bias_train = repmat( Bias , Nsample , 1 ) ;
H = sample_features_train * Weight + Bias_train ;

if ( scale_mode == 1 )
  ## scale the features for all neurons 
  [ H , k , b ] = Scale_feature( H , Saturating_threshold , scale ) ;
elseif ( scale_mode == 2 ) 
  ## else scale the features for each hidden neuron separately
  [ H , k , b ] = Scale_feature_separately( H , Saturating_threshold , scale ) ;
endif
  
## actual activation, the radial basis function
H = exp( -abs( H ) ) ;

if ( if_output_bias == 1 )
  ## we will use an output bias
  H = [ H , ones( Nsample , 1 ) ] ; 
endif

## the direct link scaling options, concatenate hidden layer and sample_features_train 
if ( scale_mode == 1 )
  ## scale the features for all neurons
  sample_features_train = sample_features_train .* k + b ;
  H = [ H , sample_features_train ] ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  [ sample_features_train , ktr , btr ] = Scale_feature_separately( sample_features_train , Saturating_threshold_activate , scale ) ;
  H = [ H , sample_features_train ] ;
else
  H = [ H , sample_features_train ] ;
endif

H( isnan( H ) ) = 0 ; ## avoids any 'blowups' due to nans in H

## do the regularized least squares for concatenated hidden layer output
## and the original, possibly scaled, input sample_features
if ( hidden_layer_size < Nsample )
 beta = ( eye( size( H , 2 ) ) / lambda + H' * H ) \ H' * sample_targets_train ;
else
 beta = H' * ( ( eye( size( H , 1 ) ) / lambda + H * H' ) \ sample_targets_train ) ; 
endif

############# now the test on test data ####################################
Bias_test = repmat( Bias , numel( sample_targets( test_ix ) ) , 1 ) ;
H_test = sample_features( test_ix , : ) * Weight + Bias_test ;

if ( scale_mode == 1 )
  ## scale the features for all neurons
  H_test = H_test .* k + b ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  nSamtest = size( H_test , 1 ) ; 
  kt = repmat( k , nSamtest , 1 ) ;
  bt = repmat( b , nSamtest , 1 ) ;
  H_test = H_test .* kt + bt ;
endif

## actual activation, the radial basis function
H_test = exp( -abs( H_test ) ) ;

if ( if_output_bias == 1 )
  ## we will use an output bias
  H_test = [ H_test , ones( numel( sample_targets( test_ix ) ) , 1 ) ] ; 
endif

## the direct link scaling options, concatenate hidden layer and sample_features_train 
if ( scale_mode == 1 )
  ## scale the features for all neurons 
  testX_temp = sample_features( test_ix , : ) .* k + b ;
  H_test = [ H_test , testX_temp ] ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  nSamtest = size( H_test , 1 ) ; 
  kt = repmat( ktr , nSamtest , 1 ) ;
  bt = repmat( btr , nSamtest , 1 ) ;
  testX_temp = sample_features( test_ix , : ) .* kt + bt ;   
  H_test = [ H_test , testX_temp ] ; 
else
  H_test = [ H_test , sample_features( test_ix , : ) ] ; 
endif

H_test( isnan( H_test ) ) = 0 ; ## avoids any 'blowups' due to nans in H_test

## get the test predicted target output
test_targets = H_test * beta ;

##Y_temp = zeros( Nsample , 1 ) ;
##% decode the target output
##for i = 1 : Nsample
##    [ maxvalue , idx ] = max( sample_targets_temp( i , : ) ) ;
##    Y_temp( i ) = U_sample_targets( idx ) ;
##endfor

############################################################################

## the final logistic output
final_output = 1.0 ./ ( 1.0 .+ exp( -test_targets ) ) ;

## get the Brier_score
## https://en.wikipedia.org/wiki/Brier_score
all_brier_values( jj ) = mean( ( final_output .- sample_targets( test_ix ) ) .^ 2 ) ;

rand( 'state' ) ; randn( 'state' ) ; ## reset rng

endfor ## end of jj loop

J = mean( all_brier_values ) ;

endfunction

## Various measures of goodness
## https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models
## https://www.fharrell.com/post/classification/
## https://stats.stackexchange.com/questions/433628/what-is-a-reliable-measure-of-accuracy-for-logistic-regression
## https://www.jstatsoft.org/article/view/v090i12
## https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
## https://stats.stackexchange.com/questions/319666/aic-with-test-data-is-it-possible
## https://www.learningmachines101.com/lm101-076-how-to-choose-the-best-model-using-aic-or-gaic/
## https://stackoverflow.com/questions/48185090/how-to-get-the-log-likelihood-for-a-logistic-regression-model-in-sklearn
## https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients
## https://stats.stackexchange.com/questions/163221/whats-the-measure-to-assess-the-binary-classification-accuracy-for-imbalanced-d
## https://stats.stackexchange.com/questions/168929/logistic-regression-is-predicting-all-1-and-no-0
## https://stats.stackexchange.com/questions/435307/multiple-linear-regression-lse-when-one-of-parameter-is-known

These functions stand on the shoulders of the above and hard code direct links, radial basis function activation and ridge regression, with the number of neurons in the hidden layer, lambda for the ridge regression, different scaling options and inclusion of output bias or not as the optimisable parameters. The function minimisation objective is the Brier score.

This second function is slightly different in that the Akaike information criterion is the minimisation objective and there is the option to use the Netlab Generalised linear model function to solve for the hidden to output weights (comment out the relevant code as necessary.)
## Copyright (C) 2019 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{J} =} rvfl_training_of_cyclic_embedding (@var{x})
##
## Function for Bayesian training of RVFL networks with fixed parameteres of:
##
##    direct links,
##    radial basis function activation,
##    ridge regression for regularized least squares,
##
## and optimisable parameters of:
##
##    number of neurons in hidden layer,
##    lambda for the least squares regression,
##    scaling of hidden layer inputs,
##    with or without an output bias.
##
## The input X is a vector of 6 values to be optimised by the BayesOpt library
## function 'bayesoptcont.'
## The output J is the AIC value for the tested model.
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2019-11-04

function J = rvfl_training_of_cyclic_embedding ( x )
global sample_features ; global sample_targets ;
epsilon = 1e-15 ; ## to ensure log() does not give out a nan

## check input x
if ( numel( x ) != 4 )
  error( 'The input vector x must be of length 6.' ) ;
endif

## get the parameters from input x
hidden_layer_size = floor( x( 1 ) ) ;    ## number of neurons in hidden layer
randomisation_type = floor( x( 2 ) ) ;   ## 1 == uniform, 2 == Gaussian
scale_mode = floor( x( 3 ) ) ;           ## 1 will scale the features for all neurons, 2 will scale the features for each hidden
##                                          neuron separately, 3 will scale the range of the randomization for uniform distribution
scale = x( 4 ) ;                         ## Linearly scale the random features before feeding into the nonlinear activation function. 
##                                          In this implementation, we consider the threshold which leads to 0.99 of the maximum/minimum 
##                                          value of the activation function as the saturating threshold.
##                                          scale = 0.9 means all the random features will be linearly scaled
##                                          into 0.9 * [ lower_saturating_threshold , upper_saturating_threshold ].
##if_output_bias = floor( x( 5 ) + 0.5 ) ; ## Use output bias, or not? 1 == yes , 0 == no.
##lambda = x( 6 ) ;                        ## the regularization coefficient lambda 

##length_jj_loop = 25 ;
##all_aic_values = zeros( length_jj_loop , 1 ) ;  

rand( 'seed' , 0 ) ;
randn( 'seed' , 0 ) ;
##U_sample_targets = unique( sample_targets ) ;
##nclass = numel( U_sample_targets ) ;

##sample_targets_temp = zeros( numel( sample_targets ) , nclass ) ; 
##
#### get the 0 - 1 one hot coding for the target,  
##for i = 1 : nclass
##  idx = sample_targets == U_sample_targets( i ) ;
##  sample_targets_temp( idx , i ) = 1 ;
##endfor

sample_targets_temp = sample_targets ;

[ Nsample , Nfea ] = size( sample_features ) ;

######### get type of randomisation from input x #################
if ( randomisation_type == 1 ) ## uniform randomisation 
  
  if ( scale_mode == 3 ) ## range scaled for uniform randomisation
     Weight = scale * ( rand( Nfea , hidden_layer_size ) * 2 - 1 ) ; ## scaled uniform random input weights to hidden layer
     Bias = scale * rand( 1 , hidden_layer_size ) ;                  ## scaled random bias weights to hidden layer
  else
     Weight = rand( Nfea , hidden_layer_size ) * 2 - 1 ; ## unscaled random input weights to hidden layer
     Bias = rand( 1 , hidden_layer_size ) ;              ## unscaled random bias weights to hidden layer
  endif
    
elseif ( randomisation_type == 2 ) ## gaussian randomisation
  Weight = randn( Nfea , hidden_layer_size ) ; ## gaussian random input weights to hidden layer
  Bias = randn( 1 , hidden_layer_size ) ;      ## gaussian random bias weights to hidden layer
else
  error( 'only Gaussian and Uniform are supported' )
endif
############################################################################

Bias_train = repmat( Bias , Nsample , 1 ) ;
H = sample_features * Weight + Bias_train ;

k_parameters = numel( Weight ) + numel( Bias_train ) ;

## Activation Function    
Saturating_threshold = [ -2.1 , 2.1 ] ;
Saturating_threshold_activate = [ 0 , 1 ] ;

if ( scale_mode == 1 )
  ## scale the features for all neurons 
  [ H , k , b ] = Scale_feature( H , Saturating_threshold , scale ) ;
elseif ( scale_mode == 2 ) 
  ## else scale the features for each hidden neuron separately
  [ H , k , b ] = Scale_feature_separately( H , Saturating_threshold , scale ) ;
endif
  
## actual activation, the radial basis function
H = exp( -abs( H ) ) ;

## glm training always applies a bias, so comment out if training with netlab glm
##if ( if_output_bias == 1 )
##  ## we will use an output bias
##  H = [ H , ones( Nsample , 1 ) ] ; 
##endif

## the direct link scaling options, concatenate hidden layer and sample_features 
if ( scale_mode == 1 )
  ## scale the features for all neurons
  sample_features_temp = sample_features .* k + b ;
  H = [ H , sample_features_temp ] ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  [ sample_features_temp , ktr , btr ] = Scale_feature_separately( sample_features , Saturating_threshold_activate , scale ) ;
  H = [ H , sample_features_temp ] ;
else
  H = [ H , sample_features ] ;
endif

H( isnan( H ) ) = 0 ; ## avoids any 'blowups' due to nans in H

############ THE ORIGINAL REGULARISED LEAST SQUARES CODE ###################
## do the regularized least squares for concatenated hidden layer output            
## and the original, possibly scaled, input sample_features                         
##if ( hidden_layer_size < Nsample )                                                  
## beta = ( eye( size( H , 2 ) ) / lambda + H' * H ) \ H' * sample_targets_temp ;     
##else                                                                                
## beta = H' * ( ( eye( size( H , 1 ) ) / lambda + H * H' ) \ sample_targets_temp ) ; 
##endif                                                                               
############################################################################

##k_parameters = k_parameters + numel( beta ) ;

## get the model predicted target output
##sample_targets_temp = H * beta ;

## the final logistic output
##final_output = 1.0 ./ ( 1.0 .+ exp( -sample_targets_temp ) ) ;

############ REPLACED BY GLM TRAINING USING NETLAB #########################
net = glm( size( H , 2 ) , 1 , 'logistic' ) ; ## Create a generalized linear model structure.
options = foptions ; ## Set default parameters for optimisation routines, for compatibility with MATLAB's foptions()
options( 1 ) = -1 ;   ## change default value
## OPTIONS(1) is set to 1 to display error values during training. If
## OPTIONS(1) is set to 0, then only warning messages are displayed.  If
## OPTIONS(1) is -1, then nothing is displayed.
options( 14 ) = 5 ; ## change default value
## OPTIONS(14) is the maximum number of iterations for the IRLS
## algorithm;  default 100.
net = glmtrain( net , options , H , sample_targets ) ;

k_parameters = k_parameters + net.nwts ;

## get output of trained glm model
final_output = glmfwd( net , H ) ;
############################################################################

##Y_temp = zeros( Nsample , 1 ) ;
##% decode the target output
##for i = 1 : Nsample
##    [ maxvalue , idx ] = max( sample_targets_temp( i , : ) ) ;
##    Y_temp( i ) = U_sample_targets( idx ) ;
##endfor

############################################################################

##  https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/
##
##  likelihood = yhat * y + (1 – yhat) * (1 – y)
##
##  We can update the likelihood function using the log to transform it into a log-likelihood function:
##
##      log-likelihood = log(yhat) * y + log(1 – yhat) * (1 – y)

##  Finally, we can sum the likelihood function across all examples in the dataset to maximize the likelihood:
##
##      maximize sum i to n log(yhat_i) * y_i + log(1 – yhat_i) * (1 – y_i)

log_likelihood = sum( log( final_output .+ epsilon ) .* sample_targets + log( 1 .- final_output .+ epsilon ) .* ( 1 .- sample_targets ) ) ;

## get Akaike Information criteria
J = 2 * k_parameters - 2 * log_likelihood ;

## get the Brier_score
## https://en.wikipedia.org/wiki/Brier_score
##J = mean( ( final_output .- sample_targets_temp ) .^ 2 ) ;

rand( 'state' ) ; randn( 'state' ) ; ## reset rng

endfunction

## Various measures of goodness
## https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models
## https://www.fharrell.com/post/classification/
## https://stats.stackexchange.com/questions/433628/what-is-a-reliable-measure-of-accuracy-for-logistic-regression
## https://www.jstatsoft.org/article/view/v090i12
## https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
## https://stats.stackexchange.com/questions/319666/aic-with-test-data-is-it-possible
## https://www.learningmachines101.com/lm101-076-how-to-choose-the-best-model-using-aic-or-gaic/
## https://stackoverflow.com/questions/48185090/how-to-get-the-log-likelihood-for-a-logistic-regression-model-in-sklearn
## https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients
## https://stats.stackexchange.com/questions/163221/whats-the-measure-to-assess-the-binary-classification-accuracy-for-imbalanced-d
## https://stats.stackexchange.com/questions/168929/logistic-regression-is-predicting-all-1-and-no-0
## https://stats.stackexchange.com/questions/435307/multiple-linear-regression-lse-when-one-of-parameter-is-known
Both of these functions are working code and are heavily commented and perhaps not very polished.

As I write this post I have various tests running in the background and will report on the results in due course.

Friday, 1 November 2019

Preliminary Results from Weight Agnostic Training

Following on from my last post, below is a selection of the typical resultant output from the Bayesopt Library minimisation
    3    3    2    2    2    8   99   22   30    1
    3    3    2    3    2   39    9   25   25    1
    2    2    3    2    2   60   43   83   54    3
    2    1    2    2    2    2    0   90   96   43
    3    2    3    2    2    2    2   43   33    1
    2    3    2    3    2    2    0   62   98   21
    2    2    2    2    2   18   43   49    2    2
    2    3    2    4    1    2    0   23    0    0
    2    2    1    2    3    2    0   24   63   65
    3    2    2    2    3    5   92   49    1    0
    2    3    2    1    1    7   84   22   17    1
    3    2    4    1    1   46    1    0   99    7
    2    2    3    2    2    2    0   74   82   50
    3    3    2    2    2   45   14   81   23    2
    2    3    3    2    2    2    0   99   79    4
    2    2    2    2    2    2    0    0   68    0
    3    3    3    2    2   67   17   37   84    1
    3    2    3    2    2   24   39   56   55    1
    3    3    4    3    2    2   30   62   67    1
    2    2    2    2    2    2    1    0    4    0
    2    2    2    2    2    2    9    8   45    1
    2    3    3    2    2   48    1   18   28    1
    2    3    3    2    2    2    0   34   42   18
    2    2    2    3    2    2    0   70   81   10
    2    2    3    3    2    2    0   85   23   11
where the rows are separate run's results, the first five columns show the type of activation function per layer and the last five columns show the number of neurons per layer ( see function code in my last post for details. )

Some quick take aways from this are:
  1. the sigmoid activation function ( bounded 0 to 1 ) is not favoured, with either the Tanh or "Lecun" sigmoid ( see section 4.4 of this paper ) being preferred
  2. 40% of the network architectures are single hidden layer with just 2 neurons in the hidden layer 
  3. 8% have only two hidden layers with the second hidden layer having just one neuron
I would say these preliminary results suggest that a deep architecture is not necessary for the features/targets being tested and obviously the standard sigmoid/logistic function should be avoided.

As is my wont whilst waiting for lengthy computer tests to complete I have also been browsing online, motivated by the early results of the above, and discovered Random Vector Functional Link Networks, which seem to be a precursor to Extreme Learning Machines. However, there appears to be some controversy about whether or not Extreme Learning Machines are just plagiarism of earlier ideas, such as RVFL networks. 

Readers may remember that I have used ELMs before with the Bayesopt library ( see my post here ) and now that the above results point towards the use of shallow networks I intend to replicate the above, but using RVFL networks. This will be the subject of my next post.   

Wednesday, 30 October 2019

Weight Agnostic Neural Net Training

I have recently come across the idea of weight agnostic neural net training and have implemented a crude version of this combined with the recent work I have been doing on Taken's Theorem ( see my posts here, here and here ) and using the statistical mechanics approach to creating synthetic data.

Using the simple Octave function below with the Akaike Information Criterion as the minimisation objective
## Copyright (C) 2019 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{J} =} wann_training_of_cyclic_embedding()
##
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2019-10-26

function J = wann_training_of_cyclic_embedding( x )
global sample_features ; global sample_targets ;
epsilon = 1e-15 ; ## to ensure log() does not give out a nan

## get the parameters from input x
activation_funcs = floor( x( 1 : 5 ) ) ; ## get the activations, 1 == sigmoid, 2 == tanh, 3 == Lecun sigmoid
layer_size = floor( x( 6 : 10 ) ) ;

[ min_layer_size , ix_min ] = min( layer_size ) ;
  if( min_layer_size > 0 ) ## to be expected most of the time
    nn_depth = length( layer_size ) ;
  elseif( min_layer_size == 0 ) ## one layer has no nodes, hence limits depth of nn
    nn_depth = ix_min - 1 ;
  endif 

length_jj_loop = 25 ;
all_aic_values = zeros( length_jj_loop , 1 ) ;  
  
for jj = 1 : length_jj_loop 
  
  previous_layer_out = sample_features ;
  sum_of_k = 0 ;
    
  for ii = 1 : nn_depth
    
    new_weight_matrix = ones( size( previous_layer_out , 2 ) , layer_size( ii ) ) ./ sqrt( size( previous_layer_out , 2 ) ) ;
    sum_of_k = sum_of_k + numel( new_weight_matrix ) ;
    prior_to_activation_input = previous_layer_out * new_weight_matrix ;

    ## select the activation function 
    if( activation_funcs( ii ) == 1 ) ## sigmoid activation
      previous_layer_out = 1.0 ./ ( 1.0 .+ exp( -prior_to_activation_input ) ) ;
    elseif( activation_funcs( ii ) == 2 ) ## tanh activation
      previous_layer_out = tanh( prior_to_activation_input ) ;
    elseif( activation_funcs( ii ) == 3 ) ## lecun sigmoid activation
      previous_layer_out = sigmoid_lecun_m( prior_to_activation_input ) ;
    endif 
    
  endfor

  ## the final logistic output
  new_weight_matrix = ones( size( previous_layer_out , 2 ) , 1 ) ./ sqrt( size( previous_layer_out , 2 ) ) ;
  sum_of_k = sum_of_k + numel( new_weight_matrix ) ;
  final_output = previous_layer_out * new_weight_matrix ;
  final_output = 1.0 ./ ( 1.0 .+ exp( -final_output ) ) ;

  max_likelihood = sum( log( final_output .+ epsilon ) .* sample_targets + log( 1 .- final_output .+ epsilon ) .* ( 1 .- sample_targets ) ) ;
  
  ## get Akaike Information criteria
  all_aic_values( jj ) = 2 * sum_of_k - 2 * max_likelihood ;

endfor ## end of jj loop

J = mean( all_aic_values ) ;

endfunction
and the Octave interface of the Bayesopt Library I am currently iterating over different architectures ( up to 5 hidden layers deep with a max of 100 nodes per layer and using a choice of 3 hidden activations ) for a simple Logistic Regression model to predict turning points in different sets of statistical mechanics synthetic data using just features based on Taken's embedding.

More in due course.

Thursday, 20 April 2017

Using the BayesOpt Library to Optimise my Planned Neural Net

Following on from my last post, I have recently been using the BayesOpt library to optimise my planned neural net, and this post is a brief outline, with code, of what I have been doing.

My intent was to design a Nonlinear autoregressive exogenous model using my currency strength indicator as the main exogenous input, along with other features derived from the use of Savitzky-Golay filter convolution to model velocity, acceleration etc. I decided that rather than model prices directly, I would model the 20 period simple moving average because it would seem reasonable to assume that modelling a smooth function would be easier, and from this average it is a trivial matter to reverse engineer to get to the underlying price.

Given that my projected feature space/lookback length/number of nodes combination is/was a triple digit, discrete dimensional problem, I used the "bayesoptdisc" function from the BayesOpt library to perform a discrete Bayesian optimisation over these parameters, the main Octave script for this being shown below.
clear all ;

% load the data
% load eurusd_daily_bin_bars ;
% load gbpusd_daily_bin_bars ;
% load usdchf_daily_bin_bars ;
load usdjpy_daily_bin_bars ;
load all_rel_strengths_non_smooth ;
% all_rel_strengths_non_smooth = [ usd_rel_strength_non_smooth eur_rel_strength_non_smooth gbp_rel_strength_non_smooth chf_rel_strength_non_smooth ...
% jpy_rel_strength_non_smooth aud_rel_strength_non_smooth cad_rel_strength_non_smooth ] ;

% extract relevant data
% price = ( eurusd_daily_bars( : , 3 ) .+ eurusd_daily_bars( : , 4 ) ) ./ 2 ; % midprice 
% price = ( gbpusd_daily_bars( : , 3 ) .+ gbpusd_daily_bars( : , 4 ) ) ./ 2 ; % midprice
% price = ( usdchf_daily_bars( : , 3 ) .+ usdchf_daily_bars( : , 4 ) ) ./ 2 ; % midprice  
price = ( usdjpy_daily_bars( : , 3 ) .+ usdjpy_daily_bars( : , 4 ) ) ./ 2 ; % midprice
base_strength = all_rel_strengths_non_smooth( : , 1 ) .- 0.5 ;
term_strength = all_rel_strengths_non_smooth( : , 5 ) .- 0.5 ; 

% clear unwanted data
% clear eurusd_daily_bars all_rel_strengths_non_smooth ;
% clear gbpusd_daily_bars all_rel_strengths_non_smooth ;
% clear usdchf_daily_bars all_rel_strengths_non_smooth ;
clear usdjpy_daily_bars all_rel_strengths_non_smooth ;

global start_opt_line_no = 200 ;
global stop_opt_line_no = 7545 ;

% get matrix coeffs
slope_coeffs = generalised_sgolay_filter_coeffs( 5 , 2 , 1 ) ;
accel_coeffs = generalised_sgolay_filter_coeffs( 5 , 2 , 2 ) ;
jerk_coeffs = generalised_sgolay_filter_coeffs( 5 , 3 , 3 ) ;

% create features
sma20 = sma( price , 20 ) ;
global targets = sma20 ;
[ sma_max , sma_min ] = adjustable_lookback_max_min( sma20 , 20 ) ;
global sma20r = zeros( size(sma20,1) , 5 ) ;
global sma20slope = zeros( size(sma20,1) , 5 ) ;
global sma20accel = zeros( size(sma20,1) , 5 ) ;
global sma20jerk = zeros( size(sma20,1) , 5 ) ;

global sma20diffs = zeros( size(sma20,1) , 5 ) ;
global sma20diffslope = zeros( size(sma20,1) , 5 ) ;
global sma20diffaccel = zeros( size(sma20,1) , 5 ) ;
global sma20diffjerk = zeros( size(sma20,1) , 5 ) ;

global base_strength_f = zeros( size(sma20,1) , 5 ) ;
global term_strength_f = zeros( size(sma20,1) , 5 ) ;

base_term_osc = base_strength .- term_strength ;
global base_term_osc_f = zeros( size(sma20,1) , 5 ) ;
slope_bt_osc = rolling_endpoint_gen_poly_output( base_term_osc , 5 , 2 , 1 ) ; % no_of_points(p),filter_order(n),derivative(s)
global slope_bt_osc_f = zeros( size(sma20,1) , 5 ) ;
accel_bt_osc = rolling_endpoint_gen_poly_output( base_term_osc , 5 , 2 , 2 ) ; % no_of_points(p),filter_order(n),derivative(s)
global accel_bt_osc_f = zeros( size(sma20,1) , 5 ) ;
jerk_bt_osc = rolling_endpoint_gen_poly_output( base_term_osc , 5 , 3 , 3 ) ; % no_of_points(p),filter_order(n),derivative(s)
global jerk_bt_osc_f = zeros( size(sma20,1) , 5 ) ;

slope_base_strength = rolling_endpoint_gen_poly_output( base_strength , 5 , 2 , 1 ) ; % no_of_points(p),filter_order(n),derivative(s)
global slope_base_strength_f = zeros( size(sma20,1) , 5 ) ;
accel_base_strength = rolling_endpoint_gen_poly_output( base_strength , 5 , 2 , 2 ) ; % no_of_points(p),filter_order(n),derivative(s)
global accel_base_strength_f = zeros( size(sma20,1) , 5 ) ;
jerk_base_strength = rolling_endpoint_gen_poly_output( base_strength , 5 , 3 , 3 ) ; % no_of_points(p),filter_order(n),derivative(s)
global jerk_base_strength_f = zeros( size(sma20,1) , 5 ) ;

slope_term_strength = rolling_endpoint_gen_poly_output( term_strength , 5 , 2 , 1 ) ; % no_of_points(p),filter_order(n),derivative(s)
global slope_term_strength_f = zeros( size(sma20,1) , 5 ) ;
accel_term_strength = rolling_endpoint_gen_poly_output( term_strength , 5 , 2 , 2 ) ; % no_of_points(p),filter_order(n),derivative(s)
global accel_term_strength_f = zeros( size(sma20,1) , 5 ) ;
jerk_term_strength = rolling_endpoint_gen_poly_output( term_strength , 5 , 3 , 3 ) ; % no_of_points(p),filter_order(n),derivative(s)
global jerk_term_strength_f = zeros( size(sma20,1) , 5 ) ;

min_max_range = sma_max .- sma_min ;

for ii = 51 : size( sma20 , 1 ) - 1 % one step ahead is target
  
  targets(ii) = 2 * ( ( sma20(ii+1) - sma_min(ii) ) / min_max_range(ii) - 0.5 ) ;
  
  % scaled sma20
  sma20r(ii,:) = 2 .* ( ( flipud( sma20(ii-4:ii,1) )' .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) ;
  sma20slope(ii,:) = fliplr( ( 2 .* ( ( sma20(ii-4:ii,1)' .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) ) * slope_coeffs ) ;
  sma20accel(ii,:) = fliplr( ( 2 .* ( ( sma20(ii-4:ii,1)' .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) ) * accel_coeffs ) ;
  sma20jerk(ii,:) = fliplr( ( 2 .* ( ( sma20(ii-4:ii,1)' .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) ) * jerk_coeffs ) ;
  
  % scaled diffs of sma20
  sma20diffs(ii,:) = fliplr( diff( 2.* ( ( sma20(ii-5:ii,1) .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) )' ) ;
  sma20diffslope(ii,:) = fliplr( diff( 2.* ( ( sma20(ii-5:ii,1) .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) )' * slope_coeffs ) ;
  sma20diffaccel(ii,:) = fliplr( diff( 2.* ( ( sma20(ii-5:ii,1) .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) )' * accel_coeffs ) ;
  sma20diffjerk(ii,:) = fliplr( diff( 2.* ( ( sma20(ii-5:ii,1) .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) )' * jerk_coeffs ) ;
  
  % base strength
  base_strength_f(ii,:) = fliplr( base_strength(ii-4:ii)' ) ;
  slope_base_strength_f(ii,:) = fliplr( slope_base_strength(ii-4:ii)' ) ;
  accel_base_strength_f(ii,:) = fliplr( accel_base_strength(ii-4:ii)' ) ;
  jerk_base_strength_f(ii,:) = fliplr( jerk_base_strength(ii-4:ii)' ) ;
  
  % term strength
  term_strength_f(ii,:) = fliplr( term_strength(ii-4:ii)' ) ;
  slope_term_strength_f(ii,:) = fliplr( slope_term_strength(ii-4:ii)' ) ;
  accel_term_strength_f(ii,:) = fliplr( accel_term_strength(ii-4:ii)' ) ;
  jerk_term_strength_f(ii,:) = fliplr( jerk_term_strength(ii-4:ii)' ) ;
  
  % base term oscillator
  base_term_osc_f(ii,:) = fliplr( base_term_osc(ii-4:ii)' ) ;
  slope_bt_osc_f(ii,:) = fliplr( slope_bt_osc(ii-4:ii)' ) ;
  accel_bt_osc_f(ii,:) = fliplr( accel_bt_osc(ii-4:ii)' ) ;
  jerk_bt_osc_f(ii,:) = fliplr( jerk_bt_osc(ii-4:ii)' ) ;
  
endfor

% create xset for bayes routine
% raw indicator
xset = zeros( 4 , 5 ) ; xset( 1 , : ) = 1 : 5 ;
% add the slopes
to_add = zeros( 4 , 15 ) ; 
to_add( 1 , : ) = [ 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 ] ; 
to_add( 2 , : ) = [ 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 ] ; 
xset = [ xset to_add ] ; 
% add accels
to_add = zeros( 4 , 21 ) ; 
to_add( 1 , : ) = [ 1 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 ] ; 
to_add( 2 , : ) = [ 1 1 2 2 1 2 2 3 3 3 1 2 3 4 2 3 3 4 4 4 4 ] ; 
to_add( 3 , : ) = [ 1 1 1 2 1 1 2 1 2 3 1 1 1 1 2 2 3 1 2 3 4 ] ;
xset = [ xset to_add ] ; 
% add jerks
to_add = zeros( 4 , 70 ) ; 
to_add( 1 , : ) = [ 1 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ] ; 
to_add( 2 , : ) = [ 1 1 2 2 2 1 2 2 2 3 3 3 3 3 3 1 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 1 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ] ; 
to_add( 3 , : ) = [ 1 1 1 2 2 1 1 2 2 1 2 2 3 3 3 1 1 2 2 1 2 2 3 3 3 1 2 2 3 3 3 4 4 4 4 1 1 2 2 1 2 2 3 3 3 1 2 2 3 3 3 4 4 4 4 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 ] ;
to_add( 4 , : ) = [ 1 1 1 1 2 1 1 1 2 1 1 2 1 2 3 1 1 1 2 1 1 2 1 2 3 1 1 2 1 2 3 1 2 3 4 1 1 1 2 1 1 2 1 2 3 1 1 2 1 2 3 1 2 3 4 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 ] ;
xset = [ xset to_add ] ;
% construct all_xset for combinations of indicators and look back lengths
all_zeros = zeros( size( xset ) ) ;
all_xset = [ xset ; repmat( all_zeros , 3 , 1 ) ] ;
all_xset = [ all_xset [ xset ; xset ; all_zeros ; all_zeros ] ] ;
all_xset = [ all_xset [ xset ; all_zeros ; xset ; all_zeros ] ] ;
all_xset = [ all_xset [ xset ; all_zeros ; all_zeros ; xset ] ] ;
all_xset = [ all_xset [ xset ; xset ; xset ; all_zeros ] ] ;
all_xset = [ all_xset [ xset ; xset ; all_zeros ; xset ] ] ;
all_xset = [ all_xset [ xset ; all_zeros ; xset ; xset ] ] ;
all_xset = [ all_xset repmat( xset , 4 , 1 ) ] ;

ones_all_xset = ones( 1 , size( all_xset , 2 ) ) ;

% now add layer for number of neurons and extend as necessary
max_number_of_neurons_in_layer = 20 ;

parameter_matrix = [] ;

for ii = 2 : max_number_of_neurons_in_layer % min no. of neurons is 2, max = max_number_of_neurons_in_layer  
  parameter_matrix = [ parameter_matrix [ ii .* ones_all_xset ; all_xset ] ] ;
endfor

% now the actual bayes optimisation routine
% set the parameters
params.n_iterations = 190; % bayesopt library default is 190
params.n_init_samples = 10;
params.crit_name = 'cEIa'; % cEI is default. cEIa is an annealed version
params.surr_name = 'sStudentTProcessNIG';
params.noise = 1e-6;
params.kernel_name = 'kMaternARD5';
params.kernel_hp_mean = [1];
params.kernel_hp_std = [10];
params.verbose_level = 1; % 3 to path below
params.log_filename = '/home/dekalog/Documents/octave/cplusplus.oct_functions/nn_functions/optimise_narx_ind_lookback_nodes_log';
params.l_type = 'L_MCMC' ; % L_EMPIRICAL is default
params.epsilon = 0.5 ; % probability of performing a random (blind) evaluation of the target function. 
% Higher values implies forced exploration while lower values relies more on the exploration/exploitation policy of the criterion. 0 is default

% the function to optimise
fun = 'optimise_narx_ind_lookback_nodes_rolling' ;

% the call to the Bayesopt library function
bayesoptdisc( fun , parameter_matrix , params ) ; 
% result is the minimum as a vector (x_out) and the value of the function at the minimum (y_out)
What this script basically does is:
  1. load all the relevant data ( in this case a forex pair )
  2. creates a set of scaled features
  3. creates a necessary parameter matrix for the discrete optimisation function
  4. sets the parameters for the optimisation routine
  5. and finally calls the "bayesoptdisc" function
Note that in step 2 all the features are declared as global variables, this being necessary because the "bayesoptdisc" function of the BayesOpt library does not appear to admit passing these variables as inputs to the function.

The actual function to be optimised is given in the following code box, and is basically a looped neural net training routine.
## Copyright (C) 2017 dekalog
## 
## This program is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see .

## -*- texinfo -*- 
## @deftypefn {} {@var{retval} =} optimise_narx_ind_lookback_nodes_rolling (@var{input1})
##
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2017-03-21

function [ retval ] = optimise_narx_ind_lookback_nodes_rolling( input1 )
 
% declare all the global variables so the function can "see" them
global start_opt_line_no ;
global stop_opt_line_no ; 
global targets ;
global sma20r ; % 2
global sma20slope ;
global sma20accel ;
global sma20jerk ;
global sma20diffs ;
global sma20diffslope ;
global sma20diffaccel ;
global sma20diffjerk ;
global base_strength_f ;
global slope_base_strength_f ;
global accel_base_strength_f ;
global jerk_base_strength_f ;
global term_strength_f ;
global slope_term_strength_f ;
global accel_term_strength_f ;
global jerk_term_strength_f ;
global base_term_osc_f ;
global slope_bt_osc_f ;
global accel_bt_osc_f ;
global jerk_bt_osc_f ;

% build feature matrix from the above global variable according to parameters in input1
hidden_layer_size = input1(1) ;

% training targets
Y = targets( start_opt_line_no:stop_opt_line_no , 1 ) ;

% create empty feature matrix
X = [] ;

% which will always have at least one element of the main price series for the NARX
X = [ X sma20r( start_opt_line_no:stop_opt_line_no , 1:input1(2) ) ] ;

% go through input1 values in turn and add to X if necessary
if input1(3) > 0
X = [ X sma20slope( start_opt_line_no:stop_opt_line_no , 1:input1(3) ) ] ; 
endif

if input1(4) > 0
X = [ X sma20accel( start_opt_line_no:stop_opt_line_no , 1:input1(4) ) ] ; 
endif

if input1(5) > 0
X = [ X sma20jerk( start_opt_line_no:stop_opt_line_no , 1:input1(5) ) ] ; 
endif

if input1(6) > 0
X = [ X sma20diffs( start_opt_line_no:stop_opt_line_no , 1:input1(6) ) ] ; 
endif

if input1(7) > 0
X = [ X sma20diffslope( start_opt_line_no:stop_opt_line_no , 1:input1(7) ) ] ; 
endif

if input1(8) > 0
X = [ X sma20diffaccel( start_opt_line_no:stop_opt_line_no , 1:input1(8) ) ] ; 
endif

if input1(9) > 0
X = [ X sma20diffjerk( start_opt_line_no:stop_opt_line_no , 1:input1(9) ) ] ; 
endif

if input1(10) > 0 % input for base and term strengths together
X = [ X base_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(10) ) ] ; 
X = [ X term_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(10) ) ] ; 
endif

if input1(11) > 0
X = [ X slope_base_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(11) ) ] ; 
X = [ X slope_term_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(11) ) ] ; 
endif

if input1(12) > 0
X = [ X accel_base_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(12) ) ] ; 
X = [ X accel_term_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(12) ) ] ; 
endif

if input1(13) > 0
X = [ X jerk_base_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(13) ) ] ; 
X = [ X jerk_term_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(13) ) ] ; 
endif

if input1(14) > 0
X = [ X base_term_osc_f( start_opt_line_no:stop_opt_line_no , 1:input1(14) ) ] ; 
endif

if input1(15) > 0
X = [ X slope_bt_osc_f( start_opt_line_no:stop_opt_line_no , 1:input1(15) ) ] ; 
endif

if input1(16) > 0
X = [ X accel_bt_osc_f( start_opt_line_no:stop_opt_line_no , 1:input1(16) ) ] ; 
endif

if input1(17) > 0
X = [ X jerk_bt_osc_f( start_opt_line_no:stop_opt_line_no , 1:input1(17) ) ] ; 
endif

% now the X features matrix has been formed, get its size
X_rows = size( X , 1 ) ; X_cols = size( X , 2 ) ;

X = [ ones( X_rows , 1 ) X ] ; % add bias unit to X

fan_in = X_cols + 1 ; % no. of inputs to a node/unit, including bias
fan_out = 1 ; % no. of outputs from node/unit
r = sqrt( 6 / ( fan_in + fan_out ) ) ; 

rolling_window_length = 100 ;
n_iters = 100 ;
n_iter_errors = zeros( n_iters , 1 ) ;
all_errors = zeros( X_rows - ( rolling_window_length - 1 ) - 1 , 1 ) ;
rolling_window_loop_iter = 0 ;

for rolling_window_loop = rolling_window_length : X_rows - 1 
 
  rolling_window_loop_iter = rolling_window_loop_iter + 1 ;
  
    % train n_iters no. of nets and put the error stats in n_iter_errors
    for ii = 1 : n_iters
      
      % initialise weights
      % see https://stats.stackexchange.com/questions/47590/what-are-good-initial-weights-in-a-neural-network
      
      % One option is Orthogonal random matrix initialization for input_to_hidden weights
      % w_i = rand( X_cols + 1 , hidden_layer_size ) ;
      % [ u , s , v ] = svd( w_i ) ;
      % input_to_hidden = [ ones( X_rows , 1 ) X ] * u ; % adding bias unit to X
      
      % using fan_in and fan_out for tanh
      w_i = ( rand( X_cols + 1 , hidden_layer_size ) .* ( 2 * r ) ) .- r ;
      input_to_hidden = X( rolling_window_loop - ( rolling_window_length - 1 ) : rolling_window_loop , : ) * w_i ; 
      
      % push the input_to_hidden through the chosen sigmoid function
      hidden_layer_output = sigmoid_lecun_m( input_to_hidden ) ;
      
      % add bias unit for the output from hidden
      hidden_layer_output = [ ones( rolling_window_length , 1 ) hidden_layer_output ] ;
      
      % use hidden_layer_output as the input to a linear regression fit to targets Y
      % a la Extreme Learning Machine
      % w = ( inv( X' * X ) * X' ) * y ; the "classic" way for linear regression, where
      % X = hidden_layer_output, but
      w = ( ( hidden_layer_output' * hidden_layer_output ) \ hidden_layer_output' ) * Y( rolling_window_loop - ( rolling_window_length - 1 ) : rolling_window_loop , 1 ) ;
      % is quicker and recommended
      
      % use these current values of w_i and w for out of sample test
      os_input_to_hidden = X( rolling_window_loop + 1 , : ) * w_i ;
      os_hidden_layer_output = sigmoid_lecun_m( os_input_to_hidden ) ;
      os_hidden_layer_output = [ 1 os_hidden_layer_output ] ; % add bias
      os_output = os_hidden_layer_output * w ;
      n_iter_errors( n_iters ) = abs( Y( rolling_window_loop + 1 , 1 ) - os_output ) ;
      
    endfor
    
    all_errors( rolling_window_loop_iter ) = mean( n_iter_errors ) ;
    
endfor % rolling_window_loop

retval = mean( all_errors ) ;

clear X w_i ; 

endfunction
However, to speed things up for some rapid prototyping, rather than use backpropagation training this function uses the principles of an extreme learning machine and loops over 100 such trained ELMs per set of features contained in a rolling window of length 100 across the entire training data set. Walk forward cross validation is performed for each of the 100 ELMs, an average of the out of sample error obtained, and these averages across the whole data set are then averaged to provide the function return. The code was run on daily bars of the four major forex pairs; EURUSD, GBPUSD, USDCHF and USDYPY.

The results of running the above are quite interesting. The first surprise is that the currency strength indicator and features derived from it were not included in the optimal model for any of the four tested pairs. Secondly, for all pairs, a scaled version of a 20 bar price momentum function, and derived features, was included in the optimal model. Finally, again for all pairs, there was a symmetrically decreasing lookback period across the selected features, and when averaged across all pairs the following pattern results: 10 3 3 2 1 3 3 2 1, which is to be read as:
  • 10 nodes (plus a bias node) in the hidden layer
  • lookback length of 3 for the scaled values of the SMA20 and the 20 bar scaled momentum function
  • lookback length of 3 for the slopes/rates of change of the above
  • lookback length of 2 for the "accelerations" of the above
  • lookback length of 1 for the "jerks" of the above 
So it would seem that the 20 bar momentum function is a better exogenous input than the currency strength indicator. The symmetry across features is quite pleasing, and the selection of these "physical motion" features across all the tested pairs tends to confirm their validity. The fact that the currency strength indicator was not selected does not mean that this indicator is of no value, but perhaps it should not be used for regression purposes, but rather as a filter. More in due course.

Tuesday, 7 February 2017

Update on Currency Strength Smoothing, and a new direction?

Since my last two posts ( currency strength indicator and preliminary tests thereof ) I have been experimenting with different ways of smoothing the indicators without introducing lag, mostly along the lines of using an oscillator leading signal plus various schemes to smooth and compensate for introduced attenuation and making heavy use of my particle swarm optimisation code. Unfortunately I haven't found anything that really works to my satisfaction and so I have decided to forgo any further attempts at this and just use the indicator in its unsmoothed form as neural net input.

In the process of doing the above work I decided that my particle swarm routine wasn't fast enough and I started using the BayesOpt optimisation library, which is written in C++ and has an interface to Octave. Doing this has greatly decreased the time I've had to spend in my various optimisation routines and the framework provided by the BayesOpt library will enable more ambitious optimisations in the future.

Another discovery for me was this Predicting Stock Market Prices with Physical Laws paper, which has some really useful ideas for neural net input features. In particular I think the idea of combining position, velocity and acceleration with the ideas contained in an earlier post of mine on Savitzky Golay filter convolution and using the currency strength indicators as proxies for the arbitrary sine and cosine waves function posited in the paper hold some promise. More in due course.    

Saturday, 3 September 2016

Possible Addition of NARX Network to Conditional Restricted Boltzmann Machine

It has been over three months since my last post, due to working away from home for some of the summer, a summer holiday and moving home. However, during this time I have continued with my online reading and some new thinking about my conditional restricted boltzmann machine based trading system has developed, namely the use of a nonlinear autoregressive exogenous model in the bottom layer gaussian units of the CRBM. Some links to reading on the same are shown below.
The exogenous time series I am thinking of using, at least for the major forex pairs and perhaps precious metals, oil and US treasuries, is a currency strength indicator based on the US dollar. In order to create the currency strength indicator I will have to delve into some data wrangling with the historical forex data I have, and this will be the subject of my next post.

    Wednesday, 3 February 2016

    Refactored Denoising Autoencoder Update #2

    Below is this second code update.
    %  select rolling window length to use - an optimisable parameter via pso?
    rolling_window_length = 50 ;
    batchsize = 5 ;
    
    %  how-many timesteps do we look back for directed connections - this is what we call the "order" of the model 
    n1 = 3 ; % first "gaussian" layer order, a best guess just for batchdata creation purposes
    n2 = 3 ; % second "binary" layer order, a best guess just for batchdata creation purposes
    
    %  taking into account rolling_window_length, n1, n2 and batchsize, get total lookback length
    remainder = rem( ( rolling_window_length + n1 + n2 ) , batchsize ) ;
    
    if ( remainder > 0 ) % number of training examples with lookback and orders n1 and n2 not exactly divisable by batchsize
    lookback_length = ( rolling_window_length + n1 + n2 + ( batchsize - remainder ) ) ; % increase the lookback_length
    else                 % number of training examples with lookback and orders n1 and n2 exactly divisable by batchsize
    lookback_length = ( rolling_window_length + n1 + n2 ) ;
    end
    
    %  create batchdataindex using lookback_length to index bars in the features matrix
    batchdataindex = ( ( training_point_index - ( lookback_length - 1 ) ) : 1 : training_point_index )' ;
    batchdata = features( batchdataindex , : ) ;
    
    %  now that the batchdata has been created, check it for autocorrelation in the features
    all_ar_coeff = zeros( size( batchdata , 2 ) , 1 ) ;
    
      for ii = 1 : size( batchdata , 2 )
      ar_coeffs = arburg( batchdata( : , ii ) , 10 , 'FPE' ) ;
      all_ar_coeff( ii ) = length( ar_coeffs ) - 1 ;
      end
      
    %  set order of gaussian_crbm, n1, to be equal to the average length of any autocorrelation in the data
    n1 = round( mean( all_ar_coeff ) ) ;  
    
    %  z-normalise the batchdata matrix with the mean and std of columns 
    data_mean = mean( batchdata , 1 ) ;
    data_std = std( batchdata , 1 ) ;
    batchdata = ( batchdata .- repmat( data_mean , size( batchdata , 1 ) , 1 ) ) ./ repmat( data_std , size( batchdata , 1 ) , 1 ) ; % batchdata is now z-normalised by data_mean & data_std
    
    %  create the minibatch index matrix for gaussian rbm pre-training of directed weights w
    minibatch = ( 1 : 1 : size( batchdata , 1 ) ) ; minibatch( 1 : ( size( batchdata , 1 ) - rolling_window_length ) ) = [] ;
    minibatch = minibatch( randperm( size( minibatch , 2 ) ) ) ; minibatch = reshape( minibatch , batchsize , size( minibatch , 2 ) / batchsize ) ; 
    
    % PRE-TRAINING FOR THE VISABLE TO HIDDEN AND THE VISIBLE TO VISIBLE WEIGHTS %%%%
    % First create a training set and target set for the pre-training of gaussian layer
    dAuto_Encode_targets = batchdata ; dAuto_Encode_training_data = [] ;
    % dAuto_Encode_targets = batchdata( : , 2 : end ) ; dAuto_Encode_training_data = [] ; % if bias added to raw data
      
      % loop to create the dAuto_Encode_training_data ( n1 == "order" of the gaussian layer of crbm )
      for ii = 1 : n1
      dAuto_Encode_training_data = [ dAuto_Encode_training_data shift( batchdata , ii ) ] ;
      end
    
    % now delete the first n1 rows due to circular shift induced mismatch of data and targets
    dAuto_Encode_targets( 1 : n1 , : ) = [] ; dAuto_Encode_training_data( 1 : n1 , : ) = [] ;
    
    % DO RBM PRE-TRAINING FOR THE BOTTOM UP DIRECTED WEIGHTS W %%%%%%%%%%%%%%%%%%%%%
    % use rbm trained initial weights instead of using random initialisation for weights
    % Doing this because we are not using regularisation in the autoencoder pre-training
    epochs = 10000 ;
    hidden_layer_size = 4 * size( dAuto_Encode_targets , 2 ) ;
    [ w_weights , w_weights_hid_bias , w_weights_vis_bias ] = cc_gaussian_rbm( dAuto_Encode_targets , minibatch , epochs , hidden_layer_size , 0.05 ) ;
    % keep a copy of these original w_weights
    w1 = w_weights ;
    [ A_weights , A_weights_hid_bias , A_weights_vis_bias ] = cc_gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , size( dAuto_Encode_targets , 2 ) , 0.05 ) ;
    [ B_weights , B_weights_hid_bias , B_weights_vis_bias ] = cc_gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , hidden_layer_size , 0.05 ) ;
    
    % END OF RBM PRE-TRAINING OF AUTOENCODER WEIGHTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
    figure(1) ; surf( A_weights ) ; title( 'A Weights after RBM training' ) ;
    figure(2) ; surf( B_weights ) ; title( 'B Weights after RBM training' ) ;
    figure(3) ; surf( w_weights ) ; title( 'w Weights after RBM training' ) ;
    figure(4) ; plot( A_weights_hid_bias , 'b' , B_weights_hid_bias , 'r' , w_weights_vis_bias , 'g' ) ; title( 'Biases after RBM training' ) ; legend( 'A' , 'B' , 'w' ) ;
    
    % DO THE AUTOENCODER TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % create weight update matrices
    A_weights_update = zeros( size( A_weights ) ) ;
    A_weights_hid_bias_update = zeros( size( A_weights_hid_bias ) ) ;
    B_weights_update = zeros( size( B_weights ) ) ;
    B_weights_hid_bias_update = zeros( size( B_weights_hid_bias ) ) ;
    w_weights_update = zeros( size( w_weights ) ) ;
    w_weights_vis_bias_update = zeros( size( w_weights_vis_bias ) ) ;
    
    % for adagrad
    historical_A = zeros( size( A_weights ) ) ;
    historical_A_hid_bias = zeros( size( A_weights_hid_bias ) ) ;
    historical_B = zeros( size( B_weights ) ) ;
    historical_B_hid_bias = zeros( size( B_weights_hid_bias ) ) ;
    historical_w = zeros( size( w_weights ) ) ;
    historical_w_vis_bias = zeros( size( w_weights_vis_bias ) ) ;
    
    % set some training parameters
    n = size( dAuto_Encode_training_data , 1 ) ; % number of training examples in dAuto_Encode_training_data
    input_layer_size = size( dAuto_Encode_training_data , 2 ) ;
    fudge_factor = 1e-6 ; % for numerical stability for adagrad
    learning_rate = 0.01 ; % will be changed to 0.001 after 50 iters through epoch loop
    mom = 0 ;            % will be changed to 0.9 after 50 iters through epoch loop
    noise = 0.5 ;
    epochs = 1000 ;
    cost = zeros( epochs , 1 ) ;
    lowest_cost = inf ;
    
      % Stochastic Gradient Descent training over dAuto_Encode_training_data 
      for iter = 1 : epochs
       
          % change momentum and learning_rate after 50 iters
          if iter == 50
          mom = 0.9 ;
          learning_rate = 0.001 ;
          end
      
          index = randperm( n ) ; % randomise the order of training examples
         
          for training_example = 1 : n
          
          % Select data for this training batch
          tmp_X = dAuto_Encode_training_data( index( training_example ) , : ) ;
          tmp_T = dAuto_Encode_targets( index( training_example ) , : ) ;
          
          % Randomly black out some of the input training data
          tmp_X( rand( size( tmp_X ) ) < noise ) = 0 ;
          
          % feedforward tmp_X through B_weights and get sigmoid e.g ret = 1.0 ./ ( 1.0 + exp(-input) )
          tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( - ( tmp_X * B_weights .+ B_weights_hid_bias ) ) ) ;
          
          % Randomly black out some of tmp_X_through_sigmoid for dropout training
          tmp_X_through_sigmoid( rand( size( tmp_X_through_sigmoid ) ) < noise ) = 0 ;
        
          % feedforward tmp_X through A_weights and add to tmp_X_through_sigmoid * w_weights for linear output layer
          final_output_layer = ( tmp_X * A_weights .+ A_weights_hid_bias ) .+ ( tmp_X_through_sigmoid * w_weights' .+ w_weights_vis_bias ) ;
        
          % now do backpropagation
          % this is the derivative of weights for the linear final_output_layer
          delta_out = ( tmp_T - final_output_layer ) ;
          
          % NOTE! gradient of sigmoid function g = sigmoid(z) .* ( 1.0 .- sigmoid(z) )
          sig_grad = tmp_X_through_sigmoid .* ( 1 .- tmp_X_through_sigmoid ) ; 
          
          % backpropagation only through the w_weights that are connected to tmp_X_through_sigmoid
          delta_hidden = ( delta_out * w_weights ) .* sig_grad ;
          
          % apply deltas from backpropagation with adagrad for the weight updates
          historical_A = historical_A .+ ( tmp_X' * delta_out ).^2 ;    
          A_weights_update = mom .* A_weights_update .+ ( learning_rate .* ( tmp_X' * delta_out ) ) ./ ( fudge_factor .+ sqrt( historical_A ) ) ;
          
          historical_A_hid_bias = historical_A_hid_bias .+ delta_out.^2 ;
          A_weights_hid_bias_update = mom .* A_weights_hid_bias_update .+ ( learning_rate .* delta_out ) ./ ( fudge_factor .+ sqrt( historical_A_hid_bias ) ) ;
          
          historical_w = historical_w .+ ( delta_out' * tmp_X_through_sigmoid ).^2 ;
          w_weights_update = mom .* w_weights_update .+ ( learning_rate .* ( delta_out' * tmp_X_through_sigmoid ) ) ./ ( fudge_factor .+ sqrt( historical_w ) ) ;
          
          historical_w_vis_bias = historical_w_vis_bias .+ delta_out.^2 ;
          w_weights_vis_bias_update = mom .* w_weights_vis_bias_update .+ ( learning_rate .* delta_out ) ./ ( fudge_factor .+ sqrt( historical_w_vis_bias ) ) ;
          
          historical_B = historical_B .+ ( tmp_X' * delta_hidden ).^2 ;
          B_weights_update = mom .* B_weights_update .+ ( learning_rate .* ( tmp_X' * delta_hidden ) ) ./ ( fudge_factor .+ sqrt( historical_B ) ) ;
          
          historical_B_hid_bias = historical_B_hid_bias .+ delta_hidden.^2 ;
          B_weights_hid_bias_update = mom .* B_weights_hid_bias_update .+ ( learning_rate .* delta_hidden ) ./ ( fudge_factor .+ sqrt( historical_B_hid_bias ) ) ;
          
          % update the weight matrices with weight_updates
          A_weights = A_weights + A_weights_update ;
          A_weights_hid_bias = A_weights_hid_bias + A_weights_hid_bias_update ;
          B_weights = B_weights + B_weights_update ;
          B_weights_hid_bias = B_weights_hid_bias + B_weights_hid_bias_update ;
          w_weights = w_weights + w_weights_update ;
          w_weights_vis_bias = w_weights_vis_bias + w_weights_vis_bias_update ;
          
          end % end of training_example loop
      
      % feedforward with this epoch's updated weights
      epoch_trained_tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( -( dAuto_Encode_training_data * B_weights .+ repmat( B_weights_hid_bias , size( dAuto_Encode_training_data , 1 ) , 1 ) ) ) ) ;
      epoch_trained_output = ( dAuto_Encode_training_data * A_weights .+ repmat( A_weights_hid_bias , size( dAuto_Encode_training_data , 1 ) , 1 ) )...
                              .+ ( epoch_trained_tmp_X_through_sigmoid * w_weights' .+ repmat( w_weights_vis_bias , size( epoch_trained_tmp_X_through_sigmoid , 1 ) , 1 ) ) ;
     
      % get sum squared error cost
      cost( iter , 1 ) = sum( sum( ( dAuto_Encode_targets .- epoch_trained_output ) .^ 2 ) ) ;
      
        % record best so far
        if cost( iter , 1 ) <= lowest_cost
           lowest_cost = cost( iter , 1 ) ;
           iter_min = iter ;
           best_A = A_weights ;
           best_B = B_weights ;
           best_w = w_weights ;
        end
      
      end % end of backpropagation epoch loop
    
    % plot weights
    figure(5) ; surf( best_A ) ; title( 'Best A Weights' ) ;
    figure(6) ; surf( best_B ) ; title( 'Best B Weights' ) ;
    figure(7) ; surf( best_w ) ; title( 'Best w Weights' ) ;
    figure(8) ; plot( A_weights_hid_bias , 'b' , B_weights_hid_bias , 'r' , w_weights_vis_bias , 'g' ) ; title( 'Biases after Autoencoder training' ) ; legend( 'A' , 'B' , 'w' ) ;
    figure(9) ; plot( cost ) ; title( 'Evolution of Autoencoder cost' ) ;
    
    % END OF CRBM WEIGHT PRE-TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
    The changes from the previous code update are a slightly different way to handle the bias units, the introduction of hidden and visible bias units from Restricted Boltzmann machine (RBM) pre-training and the introduction of an automated way to select the "order" of the Conditional Restricted Boltzmann machine (CRBM).

    The order of a CRBM is how many time steps we look back in order to model the autoregressive components. This could be decided heuristically or through cross validation but I have decided to use the Octave "arburg" function to "auto-magically" select this look back length, the idea being that the data itself informs this decision and makes the whole CRBM training algorithm adaptive to current conditions. Since the ultimate point of the CRBM will be to make predictions of future OHLC values I have chosen to use the final prediction error model selection criteria for the arburg function.

    Now that the bulk of this coding has been completed I think it would be useful to describe the proposed work flow of the various components.
    • the data and its derived inputs, such as indicators etc, are input to a Gaussian RBM as a weight initialisation step for the denoising autoencoder training. A Gaussian RBM is used because the data are real valued and not binary. This step is typical of what happens in deep learning and helps to extract meaningful features from the raw data in an unsupervised manner
    • the data and RBM initialised weights are then input to the denoising autoencoder to further model the weights and to take into account the autoregressive components of the data
    • these twice modelled weights are then used as the initial weights for the CRBM training of a Gaussian-Binary CRBM layer
    • the hidden layer of the above Gaussian-Binary CRBM is then used as data for a second Binary-Binary CRBM layer which will be stacked. The training for this second layer will follow the format above, i.e. RBM and denoising autoencoder pre-training of weights
    The next step will be for me to compile the denoising autoencoder code into an Octave C++ .oct function for speed optimisation purposes.