Dekalog Blog: Feature extraction

Showing posts with label Feature extraction. Show all posts

Monday, 3 November 2025

Expressing an Indicator in Neural Net Form, Part 3.

The results of the first set of tests of optimising an indicator via the framework of training a neural net are in, and this post is a presentation of these results and a reflection on this in more general terms. I would encourage readers to look at my previous 2 posts to put this one in context.

The following chart plot shows 8 weeks of 10 minute price action in the EURUSD forex pair, with the white equity curve being the cumulative sum of open to open tick returns over this period. The green equity curve is a similar cumulative sum of the tick returns based on going long/short when the basic indicator crosses above/below its zero line. Finally, the mass of magenta coloured equity curves are those that result from various simple moving average smoothing, momentum of and smooths of momentum, and short term MACDs of the basic indicator, with the same zero line crossing positioning.

These different magenta equity curves are expressed by setting different weights for the "decision weights" shown in the neural net diagram in the previous 2 posts, e.g. a weight matrix [ 0.25 0.25 0.25 0.25 ] is obviously a 4 bar simple moving, whilst weight matrices [ 0.25 0.25 -0.5 -0.5 ] and [ -0.25 -0.25 0.5 0.5 ] are MACDs of 2 bar fast and 4 bar slow simple moving averages (respectively a fast minus slow MACD and slow minus fast MACD - short term trend following vs. mean reversion?) In this way it is possible to create all the various smooths etc. shown above.

Taking this idea a bit further, it is possible to characterise the more "traditional" way of back testing, i.e. testing over a range of possible indicator look back periods, smooths etc. as an extremely limited or constrained form of neural net training. Consider the neural net diagram of my previous posts - a "traditional" back test is functionally equivalent to:-

setting the neural net architecture as shown in the above mentioned diagram, but only allowing linear hidden activation units
not allowing the addition of any bias units
fixing the input weights and output weights matrices prior to any training and not allowing any updating of these weights to take place
and only allowing the decision weights to be updated, and limiting this updating to a choice between a limited set of fixed weights and not allowing any error based backpropagation to take place

The next chart plot shows the out of sample result of such limited training. The white and green equity curves are the same as previously described, the cumulative sum of open to open returns and zero line crossings of the basic indicator. The single magenta coloured equity curve is the result of following this limited, "traditional" walk forward optimization

choose the best set of fixed decision weights over a 2 week window. This best set is decided by choosing the equity curve with the highest Sortino ratio. For the first train/test iteration the 2 week window of training data is not shown but is that data immediately prior to the left hand edge of the plot
this best set is used out of sample over 1 week of data, the first week of the shown magenta equity curve
at the end of this first test week, roll the training window forward 1 week so that the new training data consists of the latter week of the first set of training data and the data that has just formed the test week
repeat the training over this new set of 2 week training data and test out of sample on the immediately following week's data
keep rolling the window forward, repeating the above, until the end of the test period

The red equity curve is the equity curve that comes from the decision weights that are equivalent to a 3 bar simple moving average of the 1 bar momentum of the basic indicator, shown because this was visually identified in the first post in this series as being "the best." Comparing this equity curve with the magenta ones in the first chart it is obvious that this set of decision weights, out of sample, turns out to be probably the worst of all sets of weights. This is significant because these weights were the ones that were used to initialize the decision weights prior to neural net training.

Finally, the blue equity curve is the out of sample curve for the trained neural net, trained/tested following the same rolling methodology just described above for the magenta curve. In a hand-wavy way, the improvement in performance from the red to the blue equity curves can be said to show the benefits of optimising an indicator's performance via the framework of neural net training. This then begs the question, what if the neural net was initialised with a better set of decision weights, i.e. those of the magenta curve? This will be the subject of my next post.

Wednesday, 24 September 2025

Expressing an Indicator in Neural Net form, Part 2.

Following on from my previous post, I have been experimenting with various tweaks to the basic set-up of expressing an indicator in the form of a neural net, shown again below to avoid the necessity of having readers flip between posts.

The tweaks explored relate to the weight matrices, activation functions, targets and loss functions, the end results of which are now briefly summarized in turn.

The weight matrices are sparse and with shared weights within each separate matrix because they simply calculate the 4 indicator outputs t which represent the indicator and 3 lagged outputs of it, hence sharing the same weights. The input weights matrix is a 40 by 20 matrix with only 10 unique parameters to train/update (expressed during training by averaging across the 40 non zero weights in the matrix). Similarly, the output weights matrix only has 1 unique trainable weight. Together with bias units (not shown in the diagram above) and the 4 decision weights, the whole neural net has a total of 16 trainable weights.

The activation units are exactly as discussed in my previous post and shown in the diagram above, namely tanh on the 2 hidden layers and a sigmoid on the final output layer.

For the targets I used the sign of the slope of a sliding, centred smooth of the OHLC bar opens, on the premise that the first trade opportunity would be acted upon at the opening price following a buy/sell signal calculated on the close of a bar. The smoothing eliminates the whipsaws that would result from a single adverse return in an otherwise smooth set of returns.

The loss is a combination of a weighted cross-entropy loss and a Sortino ratio loss. The weights for the weighted cross-entropy loss are simply the absolute values of the returns, the idea being that it is more important to get the direction right on the big returns. The Sortino loss is implemented by minimizing the negative of the Sortino ratio, i.e. maximizing the ratio. There are 2 versions of this Sortino loss: an analytical gradient that spreads the cost over each training sample per training epoch similar to the cross-entropy loss, and a global loss which uses finite differences over the 4 output decision weights. Both of these Sortino losses were coded with the help of deepseek-talkai. The losses are mixed via a mixing parameter, Lambda, which varies from 0 to 1, with 0 being a pure cross-entropy loss and 1 being pure Sortino loss.

The following .gif shows the training data equity curves for each type of loss (see the titles of each plot) for every hundredth epoch up to epoch 600 over a week's worth of 10 minute forex data limited to the trading hours described in my previous post. The legend in the top left corner identifies each equity curve.

Obviously these plots show that it is possible to improve results over those of the original indicator (shown in red in the above .gif), but of course these are in-sample results. My next post will be about cross validation and out-of sample test results.

Wednesday, 3 September 2025

Expressing an Indicator in Neural Net Form

Recently I started investigating relative rotation graphs with a view to perhaps implementing a version of this for use on forex currency pairs. The underlying idea of a relative rotation graph is to plot an asset's relative strength compared to a benchmark and the momentum of this relative strength and to plot this in a form similar to a Polar coordinate system plot, which rotates around a zero point representing zero relative strength and zero relative strength momentum. After some thought it occurred to me that rather than using relative strength against a benchmark I could use the underlying relative strengths of the individual currencies, as calculated by my currency strength indicator, against each other. Furthermore, these underlying log changes in the individual currencies can be normalised using the ideas of brownian motion and then averaged together over different look back periods to create a unique indicator.

This indicator was coded up in the "traditional" way that will be familiar to anybody who has ever tried coding a trading indicator in any coding language. However, I thought that if the calculation of this indicator could be expressed in the form of a Feedforward neural network then the optimisation opportunities available using backpropagation and regularization could be used to tweek the indicator in more useful ways than just varying a look back length and amount of averaging. After some work I was able to make this work in just these two lines of Octave code:

indicator_middle_layer = tanh( full_feature_matrix * input_weights ) ;
indicator_nn_output = tanh( indicator_middle_layer * output_weights ) ;

Of course, prior to calling these two lines of code, there is some feature engineering to create the input full_feature_matrix, and the input weights and output_weights matrices taken together are mathematically equivalent to the original indicator calculations. Finally, because this is a neural net expression of the indicator, the non-linear tanh activation function is applied to the hidden middle and output layers of the net.

The following plot shows the original indicator in black and the neural net version of it in blue

over the data shown in this plot of 10 minute bars of the EURUSD forex pair.

The red indicator in the plot above is the 1 bar momentum of the neural net indicator plot.

To judge the quality of this indicator I used the entropy measure (measured over 14,000+ 10 minute bars of EURUSD), the results of which are shown next.

entropy_indicator_original = 0.9485
entropy_indicator_nn_output = 0.9933

An entropy reading of 0.9933 is probably as good as any trading indicator could hope to achieve (a perfect reading is 1.0) and so the next thing was to quickly back test the indicator performance. Based on the logic of the indicator the obvious long (short) signals are being above (below) the zeroline, or equivalently the sign of the indicator, and for good measure I also tested the sign of the momentum and some simple moving averages thereof.

The following plot shows the equity curves of this quick test where it is visually clear that the blue equity curves are "the best" when plotted in relation to the black "buy and hold" equivalent equity curve. These represent the equity curves of a 3 bar simple moving average of the 1 bar momentum of both the original formulation of the indicator and the neural net implementation. I would point out that these equity curves represent the theoretical equity resulting from trading the London session after 9:00am BST and stopping at 7:00am EST (usually about noon BST) and then starting trading again at 9:00am EST until 17:00pm EST. This schedule avoids the opening sessions (7:00 to 9:00am) in both London and New York because, from my observations of many OHLC charts such as shown above, there are frequently wild swings where price is being pushed to significant points such as volume profile clusters, accumulations of buy/sell orders etc. and in my opinion no indicator can be expected to perform well in that sort of environment. Also avoided are the hours prior to 7:00am BST, i.e. the Asian session or overnight session.

Although these equity curves might not, at first glance, be that impressive, especially as they do not account for trading costs etc. my intent on doing these tests was to determine the configuration of a final "decision" output layer to be added to the neural net implementation of the indicator. A 3 bar simple moving average of the 1 bar momentum implies the necessity to include 4 consecutive readings of the indicator output as input to a final " decision" layer. The following simple, hand-drawn sketch shows what I mean:

A discussion of this will be the subject of my next post.

Friday, 27 September 2024

Discontinuation of Oanda's OrderBook and PositionBook Endpoints via the V20 Framework

Longtime readers of this blog are almost certainly aware that over the last few years I have posted several times about Oanda's OrderBook and PositionBook data and what can be done with it. My first post was back in February 2022 where I posited the idea of using this data as a sentiment indicator, whilst my most recent post, March 2024, talked about substituting the data into standard, volume based indicators. In between these two dates I blogged about using the data as features for machine learning (here and here), different methods of plotting it (here with example trade and here) and an improved, associated optimisation method here.

Researching and posting about this has been interesting and I was quietly confident that there was some real value to be found doing this. However, I have recently been unpleasantly surprised and disappointed to learn (by way of my API cronjob downloading routines suddenly failing) that Oanda has decided to no longer make available the ability to download this data via their V20 API Framework. So, at a stroke, all of the above work has suddenly become redundant and effectively useless for back testing purposes or for future trading purposes.

Did I say I was disappointed? Well, that understates it somewhat! I have written to Oanda to express my displeasure at this recent change and perhaps, fingers crossed, they will reinstate this V20 functionality.

Friday, 8 April 2022

Simple Machine Learning Models on OrderBook/PositionBook Features

This post is about using OrderBook/PositionBook features as input to simple machine learning models after previous investigation into the relevance of such features.

Due to the amount of training data available I decided to look only at a linear model and small neural networks (NN) with a single hidden layer with up to 6 hidden neurons. This choice was motivated by an academic paper I read online about linear models which stated that, as a lower bound, one should have at least 10 training examples for each parameter to be estimated. Other online reading about order flow imbalance (OFI) suggested there is a linear relationship between OFI and price movement. Use of limited size NNs would allow a small amount of non linearity in the relationship. For this investigation I used the Netlab toolbox and Octave. A plot of the learning curves of the classification models tested is shown below. The targets were binary 1/0 for price increases/decreases.

The blue lines show the average training error (y axis) and the red lines show the same average error metric on the held out cross validation data set for each tested model. The thickness of the lines represents the number of neurons in the single hidden layer of the NNs (the thicker the lines, the higher the number of hidden neurons). The horizontal green line shows the error of a generalized linear model (GLM) trained using iteratively reweighted least squares. It can be seen that NN models with 1 and 2 hidden neurons slightly outperform the GLM, with the 2 neuron model having the edge over the 1 neuron model. NN models with 3 or more hidden neurons over fit and underperform the GLM. The NN models were trained using Netlab's functions for Bayesian regularization over the parameters.

Looking at these results it would seem that a 2 neuron NN would be the best choice; however the error differences between the 1 and 2 neuron NNs and GLM are small enough to anticipate that the final classifications (with a basic greater/less than a 0.5 logistic threshold value for long/short) would perhaps be almost identical.

Investigations into this will be the subject of my next post.

The code box below gives the working Octave code for the above.

## load data
##training_data = dlmread( 'raw_netlab_training_features' ) ;
##cv_data = dlmread( 'raw_netlab_cv_features' ) ;
training_data = dlmread( 'netlab_training_features_svd' ) ;
cv_data = dlmread( 'netlab_cv_features_svd' ) ;
training_targets = dlmread( 'netlab_training_targets' ) ;
cv_targets = dlmread( 'netlab_cv_targets' ) ;

kk_loop_record = zeros( 30 , 7 ) ;

for kk = 1 : 30
  
## first train a glm model as a base comparison
input_dim = size( training_data , 2 ) ; ## Number of inputs.

net_lin = glm( input_dim , 1 , 'logistic' ) ; ## Create a generalized linear model structure.
options = foptions ; ## Sets default parameters for optimisation routines, for compatibility with MATLAB's foptions()
options(1) = 1 ;  ## change default value
##	OPTIONS(1) is set to 1 to display error values during training. If
##	OPTIONS(1) is set to 0, then only warning messages are displayed.  If
##	OPTIONS(1) is -1, then nothing is displayed.
options(14) = 5 ; ## change default value
##	OPTIONS(14) is the maximum number of iterations for the IRLS
##	algorithm;  default 100.
net_lin = glmtrain( net_lin , options , training_data , training_targets ) ;

## test on cv_data
glm_out = glmfwd( net_lin , cv_data ) ;
## cross-entrophy loss
glm_out_loss = -mean( cv_targets .* log( glm_out )  .+ ( 1 .- cv_targets ) .* log( 1 .- glm_out ) ) ;

kk_loop_record( kk , 7 ) = glm_out_loss ;

## now train an mlp
## Set up vector of options for the optimiser.
nouter = 30 ; ## Number of outer loops.
ninner = 2 ;	## Number of innter loops.
options = foptions ; ## Default options vector.
options( 1 ) = 1 ;	## This provides display of error values.
options( 2 ) = 1.0e-5 ; ## Absolute precision for weights.
options( 3 ) = 1.0e-5 ; ## Precision for objective function.
options( 14 ) = 100 ; ## Number of training cycles in inner loop.

training_learning_curve = zeros( nouter , 6 ) ; 
cv_learning_curve = zeros( nouter , 6 ) ;

for jj = 1 : 6

## Set up network parameters.
nin = size( training_data , 2 ) ; ## Number of inputs.
nhidden = jj ;	## Number of hidden units.
nout = 1 ; ## Number of outputs.
alpha = 0.01 ; ## Initial prior hyperparameter.
aw1 = 0.01 ;
ab1 = 0.01 ;
aw2 = 0.01 ;
ab2 = 0.01 ;

## Create and initialize network weight vector.
prior = mlpprior(nin , nhidden , nout , aw1 , ab1 , aw2 , ab2 ) ;
net = mlp( nin , nhidden , nout , 'logistic' , prior ) ;

## Train using scaled conjugate gradients, re-estimating alpha and beta.
for ii = 1 : nouter
  ## train net
  net = netopt( net , options , training_data , training_targets , 'scg' ) ;
  
  train_out = mlpfwd( net , training_data ) ;
  ## get train error
  ## mse
  ##training_learning_curve( ii ) = mean( ( training_targets .- train_out ).^2 ) ;
  
  ## cross entropy loss
  training_learning_curve( ii , jj ) = -mean( training_targets .* log( train_out )  .+ ( 1 .- training_targets ) .* log( 1 .- train_out ) ) ; 

  cv_out = mlpfwd( net , cv_data ) ;
  ## get cv error
  ## mse
  ##cv_learning_curve( ii ) = mean( ( cv_targets .- cv_out ).^2 ) ;
  
  ## cross entropy loss
  cv_learning_curve( ii , jj ) = -mean( cv_targets .* log( cv_out )  .+ ( 1 .- cv_targets ) .* log( 1 .- cv_out ) ) ; 
  
  ## now update hyperparameters based on evidence
  [ net , gamma ] = evidence( net , training_data , training_targets , ninner ) ;
  
##  fprintf( 1 , '\nRe-estimation cycle ##d:\n' , ii ) ;
##  disp( [ '  alpha = ' , num2str( net.alpha' ) ] ) ;
##  fprintf( 1 , '  gamma =  %8.5f\n\n' , gamma ) ;
##  disp(' ')
##  disp('Press any key to continue.')
  ##pause;
endfor ## ii loop

endfor ## jj loop

kk_loop_record( kk , 1 : 6 ) = cv_learning_curve( end , : ) ;

endfor ## kk loop

plot( training_learning_curve(:,1) , 'b' , 'linewidth' , 1 , cv_learning_curve(:,1) , 'r' , 'linewidth' , 1 , ...
training_learning_curve(:,2) , 'b' , 'linewidth' , 2 , cv_learning_curve(:,2) , 'r' , 'linewidth' , 2 , ...
training_learning_curve(:,3) , 'b' , 'linewidth' , 3 , cv_learning_curve(:,3) , 'r' , 'linewidth' , 3 , ...
training_learning_curve(:,4) , 'b' , 'linewidth' , 4 , cv_learning_curve(:,4) , 'r' , 'linewidth' , 4 , ...
training_learning_curve(:,5) , 'b' , 'linewidth' , 5 , cv_learning_curve(:,5) , 'r' , 'linewidth' , 5 , ...
training_learning_curve(:,6) , 'b' , 'linewidth' , 6 , cv_learning_curve(:,6) , 'r' , 'linewidth' , 6 , ...
ones( size( training_learning_curve , 1 ) , 1 ).*glm_out_loss , 'g' , 'linewidth', 2 ) ;

##  >> mean(kk_loop_record)
##  ans =
##
##     0.6928   0.6927   0.7261   0.7509   0.7821   0.8112   0.6990

##  >> std(kk_loop_record)
##  ans =
##
##     8.5241e-06   7.2869e-06   1.2999e-02   1.5285e-02   2.5769e-02   2.6844e-02   2.2584e-16

Friday, 25 March 2022

OrderBook and PositionBook Features

In my previous post I talked about how I planned to use constrained optimization to create features from Oanda's OrderBook and PositionBook data, which can be downloaded via their API. In addition to this I have also created a set of features based on the idea of Order Flow Imbalance (OFI), a nice exposition of which is given in this blog post along with a numerical example of how to calculate OFI. Of course Oanda's OrderBook/PositionBook data is not exactly the same as a conventional limit order book, but I thought they are similar enough to investigate using OFI on them. The result of these investigations is shown in the animated GIF below.

This shows the output from using the R Boruta package to check for the feature relevance of OFI levels to a depth of 20 of both the OrderBook and PositionBook to classify the sign of the log return of price over the periods detailed below following an OrderBook/PositionBook update (the granularity at which the OrderBook/PositionBook data can be updated is 20 minutes):

20 minutes
40 minutes
60 minutes
the 20 minutes starting 20 minutes in the future
the 20 minutes starting 40 minutes in the future

for both the OrderBook and PositionBook, giving a total of 10 separate images/results in the above GIF.

Observant readers may notice that in the GIF there are 42 features being checked, but only an OFI depth of 20. The reason for this is that the data contain information about buys/sell orders and long/short positions both above and below the current price, so what I did was calculate OFI for:

buy orders above price vs sell orders below price
sell orders above price vs buy orders below price
long positions above price vs short positions below price
short positions above price vs long positions below price

As can be seen, almost all features are deemed to be relevant with the exception of 3 OFI levels rejected (red candles) and 2 deemed tentative (yellow candles).

It is my intention to use these features in a machine learning model to classify the probability of future market direction over the time frames mentioned above.

Thursday, 18 June 2020

More Work on RVFL Networks

Back in November last year I posted about Random Vector Functional Link (RVFL) networks here and here. Since then, along with my recent work on Oanda's API Octave functions and Market/Volume Profile visualisation, I have continued looking at RVFL networks and this post is an update on this work.

The "random" in RVFL means random initialisation of weights that are then fixed. It seems to me that it might be possible to do better than random by having some principled way of weight initialisation. To this end I have used the Penalised MATLAB Toolbox on features derived from my ideal cyclic tau embedding function to at first train a Generalized Linear Model with the Lasso penalty and then the Ridge penalty over thousands of sets of Monte Carlo generated, ideal cyclic prices and such prices with trends. The best weights for each set of prices were recorded in an array and then the mean weight (and standard deviation) taken. This set of mean weights is intended to replace the random weights in a RVFL network designed to predict the probability of "price" being at a cyclic turning point using the above cyclic tau embedding features.

Of course these weights could be considered a trained model in and of themselves, and the following screenshots show "out of sample" performance on Monte Carlo generated ideal prices that were not used in the training of the mean weights.

The black line is the underlying cyclic price and the red, blue and green lines are the mean weight model probabilities for cyclic peaks, troughs or neither respectively. Points where the peak/trough probabilities exceed the neither probabilities are marked by the red and blue vertical lines. Similarly, we have prices trending up in a cyclic fashion

and also trending down

In the cases of the last two trending markets only the swing highs and lows are indicated. The reason for this is that during training, based on my "expert knowledge" of the cyclic tau features used, it is unreasonable to expect these features to accurately capture the end of an up leg in a bull trend or the end of a down leg in a bear trend - hence these were not presented as a positive class during training.

As I said above the motivation for this is to get a more meaningful hidden layer in a RVFL network. This hidden layer will consist of seven Sigmoid functions which each give a probability of price being at or not being at a cyclic turn, conditional upon the type of market the input weights were trained on.

Tuesday, 1 October 2019

Ideal Cyclic Tau Embedding as Times Series Features

Continuing on from my Ideal Tau for Time Series Embedding post, I have now written an Octave function based on these ideas to produce features for time series modelling. The function outputs are two slightly different versions of features, examples of which are shown in the following two plots, which show up and down trends in black, following a sinusoidal sideways market partially visible to the left:

and

The function outputs are normalised price and price delayed by a quarter and a half of the cycle period (in this case 20.) The trend slopes have been chosen to exemplify the difference in the features between up and down trends. The function outputs are identical for the unseen cyclic prices to the left.

When the sets of function outputs are plotted as 3D phase space plots typical results are

with the green, blue and red phase trajectories corresponding to the cyclic, up trending and down trending portions of the above synthetic price series. The markers in this plot correspond to points in the phase trajectories at which turns in the underlying price series occur. The following plot is the above plot rotated in phase space such that the green cyclic price phase trajectory is horizontally orientated.

It is clearer in this second phase space plot that there is a qualitative difference in the phase space trajectories of the different, underlying market types.

As a final check of feature relevance I used the Boruta R package, with the above turning point markers as classification targets (a turning point vs. not a turning point,) to assess the utility of this approach in general. These tests were conducted on real price series and also on indices created by my currency strength methodology. In all such Boruta tests the features are deemed "relevant" up to a delay of five bars on daily data (I ceased the tests at the five bar mark because it was becoming tedious to carry on.)

In summary, therefore, it can be concluded that features derived using Taken's theorem are useful on financial time series provided that:

the underlying time series are normalised (detrended) and,
the embedding delay (Tau) is set to the theoretical optimum of a quarter (and multiples thereof) of the measured cyclic period of the time series.

Thursday, 20 April 2017

Using the BayesOpt Library to Optimise my Planned Neural Net

Following on from my last post, I have recently been using the BayesOpt library to optimise my planned neural net, and this post is a brief outline, with code, of what I have been doing.

My intent was to design a Nonlinear autoregressive exogenous model using my currency strength indicator as the main exogenous input, along with other features derived from the use of Savitzky-Golay filter convolution to model velocity, acceleration etc. I decided that rather than model prices directly, I would model the 20 period simple moving average because it would seem reasonable to assume that modelling a smooth function would be easier, and from this average it is a trivial matter to reverse engineer to get to the underlying price.

Given that my projected feature space/lookback length/number of nodes combination is/was a triple digit, discrete dimensional problem, I used the "bayesoptdisc" function from the BayesOpt library to perform a discrete Bayesian optimisation over these parameters, the main Octave script for this being shown below.

clear all ;

% load the data
% load eurusd_daily_bin_bars ;
% load gbpusd_daily_bin_bars ;
% load usdchf_daily_bin_bars ;
load usdjpy_daily_bin_bars ;
load all_rel_strengths_non_smooth ;
% all_rel_strengths_non_smooth = [ usd_rel_strength_non_smooth eur_rel_strength_non_smooth gbp_rel_strength_non_smooth chf_rel_strength_non_smooth ...
% jpy_rel_strength_non_smooth aud_rel_strength_non_smooth cad_rel_strength_non_smooth ] ;

% extract relevant data
% price = ( eurusd_daily_bars( : , 3 ) .+ eurusd_daily_bars( : , 4 ) ) ./ 2 ; % midprice 
% price = ( gbpusd_daily_bars( : , 3 ) .+ gbpusd_daily_bars( : , 4 ) ) ./ 2 ; % midprice
% price = ( usdchf_daily_bars( : , 3 ) .+ usdchf_daily_bars( : , 4 ) ) ./ 2 ; % midprice  
price = ( usdjpy_daily_bars( : , 3 ) .+ usdjpy_daily_bars( : , 4 ) ) ./ 2 ; % midprice
base_strength = all_rel_strengths_non_smooth( : , 1 ) .- 0.5 ;
term_strength = all_rel_strengths_non_smooth( : , 5 ) .- 0.5 ; 

% clear unwanted data
% clear eurusd_daily_bars all_rel_strengths_non_smooth ;
% clear gbpusd_daily_bars all_rel_strengths_non_smooth ;
% clear usdchf_daily_bars all_rel_strengths_non_smooth ;
clear usdjpy_daily_bars all_rel_strengths_non_smooth ;

global start_opt_line_no = 200 ;
global stop_opt_line_no = 7545 ;

% get matrix coeffs
slope_coeffs = generalised_sgolay_filter_coeffs( 5 , 2 , 1 ) ;
accel_coeffs = generalised_sgolay_filter_coeffs( 5 , 2 , 2 ) ;
jerk_coeffs = generalised_sgolay_filter_coeffs( 5 , 3 , 3 ) ;

% create features
sma20 = sma( price , 20 ) ;
global targets = sma20 ;
[ sma_max , sma_min ] = adjustable_lookback_max_min( sma20 , 20 ) ;
global sma20r = zeros( size(sma20,1) , 5 ) ;
global sma20slope = zeros( size(sma20,1) , 5 ) ;
global sma20accel = zeros( size(sma20,1) , 5 ) ;
global sma20jerk = zeros( size(sma20,1) , 5 ) ;

global sma20diffs = zeros( size(sma20,1) , 5 ) ;
global sma20diffslope = zeros( size(sma20,1) , 5 ) ;
global sma20diffaccel = zeros( size(sma20,1) , 5 ) ;
global sma20diffjerk = zeros( size(sma20,1) , 5 ) ;

global base_strength_f = zeros( size(sma20,1) , 5 ) ;
global term_strength_f = zeros( size(sma20,1) , 5 ) ;

base_term_osc = base_strength .- term_strength ;
global base_term_osc_f = zeros( size(sma20,1) , 5 ) ;
slope_bt_osc = rolling_endpoint_gen_poly_output( base_term_osc , 5 , 2 , 1 ) ; % no_of_points(p),filter_order(n),derivative(s)
global slope_bt_osc_f = zeros( size(sma20,1) , 5 ) ;
accel_bt_osc = rolling_endpoint_gen_poly_output( base_term_osc , 5 , 2 , 2 ) ; % no_of_points(p),filter_order(n),derivative(s)
global accel_bt_osc_f = zeros( size(sma20,1) , 5 ) ;
jerk_bt_osc = rolling_endpoint_gen_poly_output( base_term_osc , 5 , 3 , 3 ) ; % no_of_points(p),filter_order(n),derivative(s)
global jerk_bt_osc_f = zeros( size(sma20,1) , 5 ) ;

slope_base_strength = rolling_endpoint_gen_poly_output( base_strength , 5 , 2 , 1 ) ; % no_of_points(p),filter_order(n),derivative(s)
global slope_base_strength_f = zeros( size(sma20,1) , 5 ) ;
accel_base_strength = rolling_endpoint_gen_poly_output( base_strength , 5 , 2 , 2 ) ; % no_of_points(p),filter_order(n),derivative(s)
global accel_base_strength_f = zeros( size(sma20,1) , 5 ) ;
jerk_base_strength = rolling_endpoint_gen_poly_output( base_strength , 5 , 3 , 3 ) ; % no_of_points(p),filter_order(n),derivative(s)
global jerk_base_strength_f = zeros( size(sma20,1) , 5 ) ;

slope_term_strength = rolling_endpoint_gen_poly_output( term_strength , 5 , 2 , 1 ) ; % no_of_points(p),filter_order(n),derivative(s)
global slope_term_strength_f = zeros( size(sma20,1) , 5 ) ;
accel_term_strength = rolling_endpoint_gen_poly_output( term_strength , 5 , 2 , 2 ) ; % no_of_points(p),filter_order(n),derivative(s)
global accel_term_strength_f = zeros( size(sma20,1) , 5 ) ;
jerk_term_strength = rolling_endpoint_gen_poly_output( term_strength , 5 , 3 , 3 ) ; % no_of_points(p),filter_order(n),derivative(s)
global jerk_term_strength_f = zeros( size(sma20,1) , 5 ) ;

min_max_range = sma_max .- sma_min ;

for ii = 51 : size( sma20 , 1 ) - 1 % one step ahead is target
  
  targets(ii) = 2 * ( ( sma20(ii+1) - sma_min(ii) ) / min_max_range(ii) - 0.5 ) ;
  
  % scaled sma20
  sma20r(ii,:) = 2 .* ( ( flipud( sma20(ii-4:ii,1) )' .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) ;
  sma20slope(ii,:) = fliplr( ( 2 .* ( ( sma20(ii-4:ii,1)' .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) ) * slope_coeffs ) ;
  sma20accel(ii,:) = fliplr( ( 2 .* ( ( sma20(ii-4:ii,1)' .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) ) * accel_coeffs ) ;
  sma20jerk(ii,:) = fliplr( ( 2 .* ( ( sma20(ii-4:ii,1)' .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) ) * jerk_coeffs ) ;
  
  % scaled diffs of sma20
  sma20diffs(ii,:) = fliplr( diff( 2.* ( ( sma20(ii-5:ii,1) .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) )' ) ;
  sma20diffslope(ii,:) = fliplr( diff( 2.* ( ( sma20(ii-5:ii,1) .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) )' * slope_coeffs ) ;
  sma20diffaccel(ii,:) = fliplr( diff( 2.* ( ( sma20(ii-5:ii,1) .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) )' * accel_coeffs ) ;
  sma20diffjerk(ii,:) = fliplr( diff( 2.* ( ( sma20(ii-5:ii,1) .- sma_min(ii) ) ./ min_max_range(ii) .- 0.5 ) )' * jerk_coeffs ) ;
  
  % base strength
  base_strength_f(ii,:) = fliplr( base_strength(ii-4:ii)' ) ;
  slope_base_strength_f(ii,:) = fliplr( slope_base_strength(ii-4:ii)' ) ;
  accel_base_strength_f(ii,:) = fliplr( accel_base_strength(ii-4:ii)' ) ;
  jerk_base_strength_f(ii,:) = fliplr( jerk_base_strength(ii-4:ii)' ) ;
  
  % term strength
  term_strength_f(ii,:) = fliplr( term_strength(ii-4:ii)' ) ;
  slope_term_strength_f(ii,:) = fliplr( slope_term_strength(ii-4:ii)' ) ;
  accel_term_strength_f(ii,:) = fliplr( accel_term_strength(ii-4:ii)' ) ;
  jerk_term_strength_f(ii,:) = fliplr( jerk_term_strength(ii-4:ii)' ) ;
  
  % base term oscillator
  base_term_osc_f(ii,:) = fliplr( base_term_osc(ii-4:ii)' ) ;
  slope_bt_osc_f(ii,:) = fliplr( slope_bt_osc(ii-4:ii)' ) ;
  accel_bt_osc_f(ii,:) = fliplr( accel_bt_osc(ii-4:ii)' ) ;
  jerk_bt_osc_f(ii,:) = fliplr( jerk_bt_osc(ii-4:ii)' ) ;
  
endfor

% create xset for bayes routine
% raw indicator
xset = zeros( 4 , 5 ) ; xset( 1 , : ) = 1 : 5 ;
% add the slopes
to_add = zeros( 4 , 15 ) ; 
to_add( 1 , : ) = [ 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 ] ; 
to_add( 2 , : ) = [ 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 ] ; 
xset = [ xset to_add ] ; 
% add accels
to_add = zeros( 4 , 21 ) ; 
to_add( 1 , : ) = [ 1 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 ] ; 
to_add( 2 , : ) = [ 1 1 2 2 1 2 2 3 3 3 1 2 3 4 2 3 3 4 4 4 4 ] ; 
to_add( 3 , : ) = [ 1 1 1 2 1 1 2 1 2 3 1 1 1 1 2 2 3 1 2 3 4 ] ;
xset = [ xset to_add ] ; 
% add jerks
to_add = zeros( 4 , 70 ) ; 
to_add( 1 , : ) = [ 1 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ] ; 
to_add( 2 , : ) = [ 1 1 2 2 2 1 2 2 2 3 3 3 3 3 3 1 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 1 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ] ; 
to_add( 3 , : ) = [ 1 1 1 2 2 1 1 2 2 1 2 2 3 3 3 1 1 2 2 1 2 2 3 3 3 1 2 2 3 3 3 4 4 4 4 1 1 2 2 1 2 2 3 3 3 1 2 2 3 3 3 4 4 4 4 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 ] ;
to_add( 4 , : ) = [ 1 1 1 1 2 1 1 1 2 1 1 2 1 2 3 1 1 1 2 1 1 2 1 2 3 1 1 2 1 2 3 1 2 3 4 1 1 1 2 1 1 2 1 2 3 1 1 2 1 2 3 1 2 3 4 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 ] ;
xset = [ xset to_add ] ;
% construct all_xset for combinations of indicators and look back lengths
all_zeros = zeros( size( xset ) ) ;
all_xset = [ xset ; repmat( all_zeros , 3 , 1 ) ] ;
all_xset = [ all_xset [ xset ; xset ; all_zeros ; all_zeros ] ] ;
all_xset = [ all_xset [ xset ; all_zeros ; xset ; all_zeros ] ] ;
all_xset = [ all_xset [ xset ; all_zeros ; all_zeros ; xset ] ] ;
all_xset = [ all_xset [ xset ; xset ; xset ; all_zeros ] ] ;
all_xset = [ all_xset [ xset ; xset ; all_zeros ; xset ] ] ;
all_xset = [ all_xset [ xset ; all_zeros ; xset ; xset ] ] ;
all_xset = [ all_xset repmat( xset , 4 , 1 ) ] ;

ones_all_xset = ones( 1 , size( all_xset , 2 ) ) ;

% now add layer for number of neurons and extend as necessary
max_number_of_neurons_in_layer = 20 ;

parameter_matrix = [] ;

for ii = 2 : max_number_of_neurons_in_layer % min no. of neurons is 2, max = max_number_of_neurons_in_layer  
  parameter_matrix = [ parameter_matrix [ ii .* ones_all_xset ; all_xset ] ] ;
endfor

% now the actual bayes optimisation routine
% set the parameters
params.n_iterations = 190; % bayesopt library default is 190
params.n_init_samples = 10;
params.crit_name = 'cEIa'; % cEI is default. cEIa is an annealed version
params.surr_name = 'sStudentTProcessNIG';
params.noise = 1e-6;
params.kernel_name = 'kMaternARD5';
params.kernel_hp_mean = [1];
params.kernel_hp_std = [10];
params.verbose_level = 1; % 3 to path below
params.log_filename = '/home/dekalog/Documents/octave/cplusplus.oct_functions/nn_functions/optimise_narx_ind_lookback_nodes_log';
params.l_type = 'L_MCMC' ; % L_EMPIRICAL is default
params.epsilon = 0.5 ; % probability of performing a random (blind) evaluation of the target function. 
% Higher values implies forced exploration while lower values relies more on the exploration/exploitation policy of the criterion. 0 is default

% the function to optimise
fun = 'optimise_narx_ind_lookback_nodes_rolling' ;

% the call to the Bayesopt library function
bayesoptdisc( fun , parameter_matrix , params ) ; 
% result is the minimum as a vector (x_out) and the value of the function at the minimum (y_out)

What this script basically does is:

load all the relevant data ( in this case a forex pair )
creates a set of scaled features
creates a necessary parameter matrix for the discrete optimisation function
sets the parameters for the optimisation routine
and finally calls the "bayesoptdisc" function

Note that in step 2 all the features are declared as global variables, this being necessary because the "bayesoptdisc" function of the BayesOpt library does not appear to admit passing these variables as inputs to the function.

The actual function to be optimised is given in the following code box, and is basically a looped neural net training routine.

## Copyright (C) 2017 dekalog
## 
## This program is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see .

## -*- texinfo -*- 
## @deftypefn {} {@var{retval} =} optimise_narx_ind_lookback_nodes_rolling (@var{input1})
##
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2017-03-21

function [ retval ] = optimise_narx_ind_lookback_nodes_rolling( input1 )
 
% declare all the global variables so the function can "see" them
global start_opt_line_no ;
global stop_opt_line_no ; 
global targets ;
global sma20r ; % 2
global sma20slope ;
global sma20accel ;
global sma20jerk ;
global sma20diffs ;
global sma20diffslope ;
global sma20diffaccel ;
global sma20diffjerk ;
global base_strength_f ;
global slope_base_strength_f ;
global accel_base_strength_f ;
global jerk_base_strength_f ;
global term_strength_f ;
global slope_term_strength_f ;
global accel_term_strength_f ;
global jerk_term_strength_f ;
global base_term_osc_f ;
global slope_bt_osc_f ;
global accel_bt_osc_f ;
global jerk_bt_osc_f ;

% build feature matrix from the above global variable according to parameters in input1
hidden_layer_size = input1(1) ;

% training targets
Y = targets( start_opt_line_no:stop_opt_line_no , 1 ) ;

% create empty feature matrix
X = [] ;

% which will always have at least one element of the main price series for the NARX
X = [ X sma20r( start_opt_line_no:stop_opt_line_no , 1:input1(2) ) ] ;

% go through input1 values in turn and add to X if necessary
if input1(3) > 0
X = [ X sma20slope( start_opt_line_no:stop_opt_line_no , 1:input1(3) ) ] ; 
endif

if input1(4) > 0
X = [ X sma20accel( start_opt_line_no:stop_opt_line_no , 1:input1(4) ) ] ; 
endif

if input1(5) > 0
X = [ X sma20jerk( start_opt_line_no:stop_opt_line_no , 1:input1(5) ) ] ; 
endif

if input1(6) > 0
X = [ X sma20diffs( start_opt_line_no:stop_opt_line_no , 1:input1(6) ) ] ; 
endif

if input1(7) > 0
X = [ X sma20diffslope( start_opt_line_no:stop_opt_line_no , 1:input1(7) ) ] ; 
endif

if input1(8) > 0
X = [ X sma20diffaccel( start_opt_line_no:stop_opt_line_no , 1:input1(8) ) ] ; 
endif

if input1(9) > 0
X = [ X sma20diffjerk( start_opt_line_no:stop_opt_line_no , 1:input1(9) ) ] ; 
endif

if input1(10) > 0 % input for base and term strengths together
X = [ X base_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(10) ) ] ; 
X = [ X term_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(10) ) ] ; 
endif

if input1(11) > 0
X = [ X slope_base_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(11) ) ] ; 
X = [ X slope_term_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(11) ) ] ; 
endif

if input1(12) > 0
X = [ X accel_base_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(12) ) ] ; 
X = [ X accel_term_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(12) ) ] ; 
endif

if input1(13) > 0
X = [ X jerk_base_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(13) ) ] ; 
X = [ X jerk_term_strength_f( start_opt_line_no:stop_opt_line_no , 1:input1(13) ) ] ; 
endif

if input1(14) > 0
X = [ X base_term_osc_f( start_opt_line_no:stop_opt_line_no , 1:input1(14) ) ] ; 
endif

if input1(15) > 0
X = [ X slope_bt_osc_f( start_opt_line_no:stop_opt_line_no , 1:input1(15) ) ] ; 
endif

if input1(16) > 0
X = [ X accel_bt_osc_f( start_opt_line_no:stop_opt_line_no , 1:input1(16) ) ] ; 
endif

if input1(17) > 0
X = [ X jerk_bt_osc_f( start_opt_line_no:stop_opt_line_no , 1:input1(17) ) ] ; 
endif

% now the X features matrix has been formed, get its size
X_rows = size( X , 1 ) ; X_cols = size( X , 2 ) ;

X = [ ones( X_rows , 1 ) X ] ; % add bias unit to X

fan_in = X_cols + 1 ; % no. of inputs to a node/unit, including bias
fan_out = 1 ; % no. of outputs from node/unit
r = sqrt( 6 / ( fan_in + fan_out ) ) ; 

rolling_window_length = 100 ;
n_iters = 100 ;
n_iter_errors = zeros( n_iters , 1 ) ;
all_errors = zeros( X_rows - ( rolling_window_length - 1 ) - 1 , 1 ) ;
rolling_window_loop_iter = 0 ;

for rolling_window_loop = rolling_window_length : X_rows - 1 
 
  rolling_window_loop_iter = rolling_window_loop_iter + 1 ;
  
    % train n_iters no. of nets and put the error stats in n_iter_errors
    for ii = 1 : n_iters
      
      % initialise weights
      % see https://stats.stackexchange.com/questions/47590/what-are-good-initial-weights-in-a-neural-network
      
      % One option is Orthogonal random matrix initialization for input_to_hidden weights
      % w_i = rand( X_cols + 1 , hidden_layer_size ) ;
      % [ u , s , v ] = svd( w_i ) ;
      % input_to_hidden = [ ones( X_rows , 1 ) X ] * u ; % adding bias unit to X
      
      % using fan_in and fan_out for tanh
      w_i = ( rand( X_cols + 1 , hidden_layer_size ) .* ( 2 * r ) ) .- r ;
      input_to_hidden = X( rolling_window_loop - ( rolling_window_length - 1 ) : rolling_window_loop , : ) * w_i ; 
      
      % push the input_to_hidden through the chosen sigmoid function
      hidden_layer_output = sigmoid_lecun_m( input_to_hidden ) ;
      
      % add bias unit for the output from hidden
      hidden_layer_output = [ ones( rolling_window_length , 1 ) hidden_layer_output ] ;
      
      % use hidden_layer_output as the input to a linear regression fit to targets Y
      % a la Extreme Learning Machine
      % w = ( inv( X' * X ) * X' ) * y ; the "classic" way for linear regression, where
      % X = hidden_layer_output, but
      w = ( ( hidden_layer_output' * hidden_layer_output ) \ hidden_layer_output' ) * Y( rolling_window_loop - ( rolling_window_length - 1 ) : rolling_window_loop , 1 ) ;
      % is quicker and recommended
      
      % use these current values of w_i and w for out of sample test
      os_input_to_hidden = X( rolling_window_loop + 1 , : ) * w_i ;
      os_hidden_layer_output = sigmoid_lecun_m( os_input_to_hidden ) ;
      os_hidden_layer_output = [ 1 os_hidden_layer_output ] ; % add bias
      os_output = os_hidden_layer_output * w ;
      n_iter_errors( n_iters ) = abs( Y( rolling_window_loop + 1 , 1 ) - os_output ) ;
      
    endfor
    
    all_errors( rolling_window_loop_iter ) = mean( n_iter_errors ) ;
    
endfor % rolling_window_loop

retval = mean( all_errors ) ;

clear X w_i ; 

endfunction

However, to speed things up for some rapid prototyping, rather than use backpropagation training this function uses the principles of an extreme learning machine and loops over 100 such trained ELMs per set of features contained in a rolling window of length 100 across the entire training data set. Walk forward cross validation is performed for each of the 100 ELMs, an average of the out of sample error obtained, and these averages across the whole data set are then averaged to provide the function return. The code was run on daily bars of the four major forex pairs; EURUSD, GBPUSD, USDCHF and USDYPY.

The results of running the above are quite interesting. The first surprise is that the currency strength indicator and features derived from it were not included in the optimal model for any of the four tested pairs. Secondly, for all pairs, a scaled version of a 20 bar price momentum function, and derived features, was included in the optimal model. Finally, again for all pairs, there was a symmetrically decreasing lookback period across the selected features, and when averaged across all pairs the following pattern results: 10 3 3 2 1 3 3 2 1, which is to be read as:

10 nodes (plus a bias node) in the hidden layer
lookback length of 3 for the scaled values of the SMA20 and the 20 bar scaled momentum function
lookback length of 3 for the slopes/rates of change of the above
lookback length of 2 for the "accelerations" of the above
lookback length of 1 for the "jerks" of the above

So it would seem that the 20 bar momentum function is a better exogenous input than the currency strength indicator. The symmetry across features is quite pleasing, and the selection of these "physical motion" features across all the tested pairs tends to confirm their validity. The fact that the currency strength indicator was not selected does not mean that this indicator is of no value, but perhaps it should not be used for regression purposes, but rather as a filter. More in due course.

Wednesday, 3 February 2016

Refactored Denoising Autoencoder Update #2

Below is this second code update.

%  select rolling window length to use - an optimisable parameter via pso?
rolling_window_length = 50 ;
batchsize = 5 ;

%  how-many timesteps do we look back for directed connections - this is what we call the "order" of the model 
n1 = 3 ; % first "gaussian" layer order, a best guess just for batchdata creation purposes
n2 = 3 ; % second "binary" layer order, a best guess just for batchdata creation purposes

%  taking into account rolling_window_length, n1, n2 and batchsize, get total lookback length
remainder = rem( ( rolling_window_length + n1 + n2 ) , batchsize ) ;

if ( remainder > 0 ) % number of training examples with lookback and orders n1 and n2 not exactly divisable by batchsize
lookback_length = ( rolling_window_length + n1 + n2 + ( batchsize - remainder ) ) ; % increase the lookback_length
else                 % number of training examples with lookback and orders n1 and n2 exactly divisable by batchsize
lookback_length = ( rolling_window_length + n1 + n2 ) ;
end

%  create batchdataindex using lookback_length to index bars in the features matrix
batchdataindex = ( ( training_point_index - ( lookback_length - 1 ) ) : 1 : training_point_index )' ;
batchdata = features( batchdataindex , : ) ;

%  now that the batchdata has been created, check it for autocorrelation in the features
all_ar_coeff = zeros( size( batchdata , 2 ) , 1 ) ;

  for ii = 1 : size( batchdata , 2 )
  ar_coeffs = arburg( batchdata( : , ii ) , 10 , 'FPE' ) ;
  all_ar_coeff( ii ) = length( ar_coeffs ) - 1 ;
  end
  
%  set order of gaussian_crbm, n1, to be equal to the average length of any autocorrelation in the data
n1 = round( mean( all_ar_coeff ) ) ;  

%  z-normalise the batchdata matrix with the mean and std of columns 
data_mean = mean( batchdata , 1 ) ;
data_std = std( batchdata , 1 ) ;
batchdata = ( batchdata .- repmat( data_mean , size( batchdata , 1 ) , 1 ) ) ./ repmat( data_std , size( batchdata , 1 ) , 1 ) ; % batchdata is now z-normalised by data_mean & data_std

%  create the minibatch index matrix for gaussian rbm pre-training of directed weights w
minibatch = ( 1 : 1 : size( batchdata , 1 ) ) ; minibatch( 1 : ( size( batchdata , 1 ) - rolling_window_length ) ) = [] ;
minibatch = minibatch( randperm( size( minibatch , 2 ) ) ) ; minibatch = reshape( minibatch , batchsize , size( minibatch , 2 ) / batchsize ) ; 

% PRE-TRAINING FOR THE VISABLE TO HIDDEN AND THE VISIBLE TO VISIBLE WEIGHTS %%%%
% First create a training set and target set for the pre-training of gaussian layer
dAuto_Encode_targets = batchdata ; dAuto_Encode_training_data = [] ;
% dAuto_Encode_targets = batchdata( : , 2 : end ) ; dAuto_Encode_training_data = [] ; % if bias added to raw data
  
  % loop to create the dAuto_Encode_training_data ( n1 == "order" of the gaussian layer of crbm )
  for ii = 1 : n1
  dAuto_Encode_training_data = [ dAuto_Encode_training_data shift( batchdata , ii ) ] ;
  end

% now delete the first n1 rows due to circular shift induced mismatch of data and targets
dAuto_Encode_targets( 1 : n1 , : ) = [] ; dAuto_Encode_training_data( 1 : n1 , : ) = [] ;

% DO RBM PRE-TRAINING FOR THE BOTTOM UP DIRECTED WEIGHTS W %%%%%%%%%%%%%%%%%%%%%
% use rbm trained initial weights instead of using random initialisation for weights
% Doing this because we are not using regularisation in the autoencoder pre-training
epochs = 10000 ;
hidden_layer_size = 4 * size( dAuto_Encode_targets , 2 ) ;
[ w_weights , w_weights_hid_bias , w_weights_vis_bias ] = cc_gaussian_rbm( dAuto_Encode_targets , minibatch , epochs , hidden_layer_size , 0.05 ) ;
% keep a copy of these original w_weights
w1 = w_weights ;
[ A_weights , A_weights_hid_bias , A_weights_vis_bias ] = cc_gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , size( dAuto_Encode_targets , 2 ) , 0.05 ) ;
[ B_weights , B_weights_hid_bias , B_weights_vis_bias ] = cc_gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , hidden_layer_size , 0.05 ) ;

% END OF RBM PRE-TRAINING OF AUTOENCODER WEIGHTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

figure(1) ; surf( A_weights ) ; title( 'A Weights after RBM training' ) ;
figure(2) ; surf( B_weights ) ; title( 'B Weights after RBM training' ) ;
figure(3) ; surf( w_weights ) ; title( 'w Weights after RBM training' ) ;
figure(4) ; plot( A_weights_hid_bias , 'b' , B_weights_hid_bias , 'r' , w_weights_vis_bias , 'g' ) ; title( 'Biases after RBM training' ) ; legend( 'A' , 'B' , 'w' ) ;

% DO THE AUTOENCODER TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% create weight update matrices
A_weights_update = zeros( size( A_weights ) ) ;
A_weights_hid_bias_update = zeros( size( A_weights_hid_bias ) ) ;
B_weights_update = zeros( size( B_weights ) ) ;
B_weights_hid_bias_update = zeros( size( B_weights_hid_bias ) ) ;
w_weights_update = zeros( size( w_weights ) ) ;
w_weights_vis_bias_update = zeros( size( w_weights_vis_bias ) ) ;

% for adagrad
historical_A = zeros( size( A_weights ) ) ;
historical_A_hid_bias = zeros( size( A_weights_hid_bias ) ) ;
historical_B = zeros( size( B_weights ) ) ;
historical_B_hid_bias = zeros( size( B_weights_hid_bias ) ) ;
historical_w = zeros( size( w_weights ) ) ;
historical_w_vis_bias = zeros( size( w_weights_vis_bias ) ) ;

% set some training parameters
n = size( dAuto_Encode_training_data , 1 ) ; % number of training examples in dAuto_Encode_training_data
input_layer_size = size( dAuto_Encode_training_data , 2 ) ;
fudge_factor = 1e-6 ; % for numerical stability for adagrad
learning_rate = 0.01 ; % will be changed to 0.001 after 50 iters through epoch loop
mom = 0 ;            % will be changed to 0.9 after 50 iters through epoch loop
noise = 0.5 ;
epochs = 1000 ;
cost = zeros( epochs , 1 ) ;
lowest_cost = inf ;

  % Stochastic Gradient Descent training over dAuto_Encode_training_data 
  for iter = 1 : epochs
   
      % change momentum and learning_rate after 50 iters
      if iter == 50
      mom = 0.9 ;
      learning_rate = 0.001 ;
      end
  
      index = randperm( n ) ; % randomise the order of training examples
     
      for training_example = 1 : n
      
      % Select data for this training batch
      tmp_X = dAuto_Encode_training_data( index( training_example ) , : ) ;
      tmp_T = dAuto_Encode_targets( index( training_example ) , : ) ;
      
      % Randomly black out some of the input training data
      tmp_X( rand( size( tmp_X ) ) < noise ) = 0 ;
      
      % feedforward tmp_X through B_weights and get sigmoid e.g ret = 1.0 ./ ( 1.0 + exp(-input) )
      tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( - ( tmp_X * B_weights .+ B_weights_hid_bias ) ) ) ;
      
      % Randomly black out some of tmp_X_through_sigmoid for dropout training
      tmp_X_through_sigmoid( rand( size( tmp_X_through_sigmoid ) ) < noise ) = 0 ;
    
      % feedforward tmp_X through A_weights and add to tmp_X_through_sigmoid * w_weights for linear output layer
      final_output_layer = ( tmp_X * A_weights .+ A_weights_hid_bias ) .+ ( tmp_X_through_sigmoid * w_weights' .+ w_weights_vis_bias ) ;
    
      % now do backpropagation
      % this is the derivative of weights for the linear final_output_layer
      delta_out = ( tmp_T - final_output_layer ) ;
      
      % NOTE! gradient of sigmoid function g = sigmoid(z) .* ( 1.0 .- sigmoid(z) )
      sig_grad = tmp_X_through_sigmoid .* ( 1 .- tmp_X_through_sigmoid ) ; 
      
      % backpropagation only through the w_weights that are connected to tmp_X_through_sigmoid
      delta_hidden = ( delta_out * w_weights ) .* sig_grad ;
      
      % apply deltas from backpropagation with adagrad for the weight updates
      historical_A = historical_A .+ ( tmp_X' * delta_out ).^2 ;    
      A_weights_update = mom .* A_weights_update .+ ( learning_rate .* ( tmp_X' * delta_out ) ) ./ ( fudge_factor .+ sqrt( historical_A ) ) ;
      
      historical_A_hid_bias = historical_A_hid_bias .+ delta_out.^2 ;
      A_weights_hid_bias_update = mom .* A_weights_hid_bias_update .+ ( learning_rate .* delta_out ) ./ ( fudge_factor .+ sqrt( historical_A_hid_bias ) ) ;
      
      historical_w = historical_w .+ ( delta_out' * tmp_X_through_sigmoid ).^2 ;
      w_weights_update = mom .* w_weights_update .+ ( learning_rate .* ( delta_out' * tmp_X_through_sigmoid ) ) ./ ( fudge_factor .+ sqrt( historical_w ) ) ;
      
      historical_w_vis_bias = historical_w_vis_bias .+ delta_out.^2 ;
      w_weights_vis_bias_update = mom .* w_weights_vis_bias_update .+ ( learning_rate .* delta_out ) ./ ( fudge_factor .+ sqrt( historical_w_vis_bias ) ) ;
      
      historical_B = historical_B .+ ( tmp_X' * delta_hidden ).^2 ;
      B_weights_update = mom .* B_weights_update .+ ( learning_rate .* ( tmp_X' * delta_hidden ) ) ./ ( fudge_factor .+ sqrt( historical_B ) ) ;
      
      historical_B_hid_bias = historical_B_hid_bias .+ delta_hidden.^2 ;
      B_weights_hid_bias_update = mom .* B_weights_hid_bias_update .+ ( learning_rate .* delta_hidden ) ./ ( fudge_factor .+ sqrt( historical_B_hid_bias ) ) ;
      
      % update the weight matrices with weight_updates
      A_weights = A_weights + A_weights_update ;
      A_weights_hid_bias = A_weights_hid_bias + A_weights_hid_bias_update ;
      B_weights = B_weights + B_weights_update ;
      B_weights_hid_bias = B_weights_hid_bias + B_weights_hid_bias_update ;
      w_weights = w_weights + w_weights_update ;
      w_weights_vis_bias = w_weights_vis_bias + w_weights_vis_bias_update ;
      
      end % end of training_example loop
  
  % feedforward with this epoch's updated weights
  epoch_trained_tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( -( dAuto_Encode_training_data * B_weights .+ repmat( B_weights_hid_bias , size( dAuto_Encode_training_data , 1 ) , 1 ) ) ) ) ;
  epoch_trained_output = ( dAuto_Encode_training_data * A_weights .+ repmat( A_weights_hid_bias , size( dAuto_Encode_training_data , 1 ) , 1 ) )...
                          .+ ( epoch_trained_tmp_X_through_sigmoid * w_weights' .+ repmat( w_weights_vis_bias , size( epoch_trained_tmp_X_through_sigmoid , 1 ) , 1 ) ) ;
 
  % get sum squared error cost
  cost( iter , 1 ) = sum( sum( ( dAuto_Encode_targets .- epoch_trained_output ) .^ 2 ) ) ;
  
    % record best so far
    if cost( iter , 1 ) <= lowest_cost
       lowest_cost = cost( iter , 1 ) ;
       iter_min = iter ;
       best_A = A_weights ;
       best_B = B_weights ;
       best_w = w_weights ;
    end
  
  end % end of backpropagation epoch loop

% plot weights
figure(5) ; surf( best_A ) ; title( 'Best A Weights' ) ;
figure(6) ; surf( best_B ) ; title( 'Best B Weights' ) ;
figure(7) ; surf( best_w ) ; title( 'Best w Weights' ) ;
figure(8) ; plot( A_weights_hid_bias , 'b' , B_weights_hid_bias , 'r' , w_weights_vis_bias , 'g' ) ; title( 'Biases after Autoencoder training' ) ; legend( 'A' , 'B' , 'w' ) ;
figure(9) ; plot( cost ) ; title( 'Evolution of Autoencoder cost' ) ;

% END OF CRBM WEIGHT PRE-TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The changes from the previous code update are a slightly different way to handle the bias units, the introduction of hidden and visible bias units from Restricted Boltzmann machine (RBM) pre-training and the introduction of an automated way to select the "order" of the Conditional Restricted Boltzmann machine (CRBM).

The order of a CRBM is how many time steps we look back in order to model the autoregressive components. This could be decided heuristically or through cross validation but I have decided to use the Octave "arburg" function to "auto-magically" select this look back length, the idea being that the data itself informs this decision and makes the whole CRBM training algorithm adaptive to current conditions. Since the ultimate point of the CRBM will be to make predictions of future OHLC values I have chosen to use the final prediction error model selection criteria for the arburg function.

Now that the bulk of this coding has been completed I think it would be useful to describe the proposed work flow of the various components.

the data and its derived inputs, such as indicators etc, are input to a Gaussian RBM as a weight initialisation step for the denoising autoencoder training. A Gaussian RBM is used because the data are real valued and not binary. This step is typical of what happens in deep learning and helps to extract meaningful features from the raw data in an unsupervised manner
the data and RBM initialised weights are then input to the denoising autoencoder to further model the weights and to take into account the autoregressive components of the data
these twice modelled weights are then used as the initial weights for the CRBM training of a Gaussian-Binary CRBM layer
the hidden layer of the above Gaussian-Binary CRBM is then used as data for a second Binary-Binary CRBM layer which will be stacked. The training for this second layer will follow the format above, i.e. RBM and denoising autoencoder pre-training of weights

The next step will be for me to compile the denoising autoencoder code into an Octave C++ .oct function for speed optimisation purposes.

Tuesday, 13 October 2015

Giving up on Runge-Kutta Methods (for now?)

Over the last few weeks I have been looking at using Runge-Kutta methods for the creation of features, but I have decided to give up on this for now simply because I think I have found a better way to accomplish what I want. I was alerted to this possible approach by this post over at http://dsp.stackexchange.com/ and following up on this I remembered that a few years ago I coded up John Ehler's Linear Prediction Filter (the white paper for which might still be available here) and my crudely transcribed code given below:

 DEFUN_DLD (linearpredict, args, , "Help String")
{
octave_value retval;

 ColumnVector a = args(0).column_vector_value ();                                   // The input price                          
 ColumnVector b(a);                                                                 // This will be the output column vector returned to Octave by "retval"
 const int n = args(1).int_value();
 int lngth = 10;                                                                 
 
 double g[30];
 int zz;
 for (zz = 0; zz < 30; zz++)
     g[zz] = 0.0;

 double sigPredict[30];
 for (zz = 0; zz < 30; zz++)
     sigPredict[zz] = 0.0;

 double sigPower = 0.0;
 double mu = 0.0;
 double xBar = 0.0;
 int jj = 0;                                                          
 for (octave_idx_type ii (10); ii < a.length="" 10="" a="" average="" factor="" for="" href="https://draft.blogger.com/null" if="" ii="" jj="" lngth="" loop="" normalization="" ompute="" onvergence="" power="" sigpower="" start="" the=""> 0)
         mu = 0.25 / (sigPower * 10);

      //Compute signal estimate
      xBar = 0;
      for (jj = 1; jj <= lngth; jj++)
          xBar = xBar + a(ii - jj) * g[jj];

     //Compute gain coefficients
     for (jj = 1; jj <= lngth; jj++)
         g[jj] = g[jj] + (mu * (a(ii) - xBar) * a(ii - jj));

     //Compute signal prediction waveform
     for (jj = 0; jj <= lngth; jj++)
         sigPredict[jj] = a(ii - (10 - jj));

     //Extend signal prediction into the future
     int kk = 0;
     for (jj = lngth + 1; jj <= lngth + 5; jj++)
     {
         sigPredict[jj] = 0;

         for (kk = 1; kk <= lngth; kk++)
	     sigPredict[jj] = sigPredict[jj] + sigPredict[jj - kk] * g[kk];
     }

     b(ii) = sigPredict[lngth + n];

     }
 
retval = b;                                                                         // Assign the output column vector to the return value

return retval;                                                                       // Return the output to Octave
}

which is very similar in concept to Burg's method. I think some application of these methods show more promise than concentrating on my naive implementation of Runge-Kutta.

Monday, 28 September 2015

Runge-Kutta Example and Code

Following on from my last post I thought I would, as a first step, code up a "straightforward" Runge-Kutta function and show how to deal with the fact that there is no "magic mathematical formula" to calculate the slopes that are an integral part of Runge-Kutta.

My approach is to fit a quadratic function to the last n_bars of price and take the slope of this via my Savitzky-Golay filter convolution code, and in doing so the Runge-Kutta value k1 can easily be obtained. The extrapolation beyond the k1 point to the points k2 and k3 is, in effect, trying to fit to points that have a half bar lead over the last available price. To accommodate this I use the last n_bar - 1 points of a 2 bar simple moving average plus the "position" of points k2 and k3 to calculate the slopes at k2 and k3. A 2 bar simple moving average is used because this has a half bar lag and is effectively an interpolation of the known prices at the "half bar intervals," and therefore points k2 and k3 are one h step ahead of the last half bar interval. The k4 point is again simply calculated directly from prices plus the half bar interval projection from point k3. If all this seems confusing, hopefully the Octave code below will clear things up.

clear all

%  create the raw price series
period = input( 'Period? ' ) ;
sideways = zeros( 1 , 2*period ) ;
uwr = ( 1 : 1 : 2*period ) .* 1.333 / period ; uwr_end = uwr(end) ;
unr = ( 1 : 1 : 2*period ) .* 4 / period ; unr_end = unr(end) ;
dwr = ( 1 : 1 : 2*period ) .* -1.333 / period ; dwr_end = dwr(end) ;
dnr = ( 1 : 1 : 2*period ) .* -4 / period ; dnr_end = dnr(end) ;

trends = [ sideways , uwr , unr.+uwr_end , sideways.+uwr_end.+unr_end , dnr.+uwr_end.+unr_end , dwr.+uwr_end.+unr_end.+dnr_end , sideways ] .+ 2 ;
noise = randn( 1 , length(trends) ) .* 0.0 ;
price = sinewave( length( trends ) , period ) .+ trends .+ noise ;
ma_2 = sma( price , 2 ) ;
 
%  regress over 'n_bar' bars
n_bar = 9 ;
%  and a 'p' order fit
p = 2 ;

%  get the relevant coefficients
slope_coeffs = generalised_sgolay_filter_coeffs( n_bar , p , 1 ) ; 

%  container for 1 bar ahead projection
projection_1_bar = zeros( 1 , length( price ) ) ;

for ii = n_bar : length( price )

%  calculate k1 value i.e. values at price(ii), the most recent price
k1 = price( ii-(n_bar-1) : ii ) * slope_coeffs( : , end ) ;
projection_of_point_k2 = price(ii) + k1 / 2 ;

%  calculate k2 value
k2 = [ ma_2( ii-(n_bar-2) : ii ) ; projection_of_point_k2 ]' * slope_coeffs( : , end ) ;
projection_of_point_k3 = price(ii) + k2 / 2 ;

%  calculate k3 value
k3 = [ ma_2( ii-(n_bar-2) : ii ) ; projection_of_point_k3 ]' * slope_coeffs( : , end ) ;
projection_of_point_k4 = price(ii) + k3 / 2 ;

%  calculate k4 value
k4 = [ price( ii-(n_bar-2) : ii ) , projection_of_point_k4 ] * slope_coeffs( : , end ) ;

%  the runge-kutta weighted moving average
projection_1_bar(ii) = price(ii) + ( k1 + 2 * ( k2 + k3 ) + k4 ) / 6 ;

end

%  shift for plotting
projection_1_bar = shift( projection_1_bar , 1 ) ;
projection_1_bar( : , 1:n_bar ) = price( : , 1:n_bar ) ;

plot( price , 'c' , projection_1_bar , 'r' ) ;

This code produces a plot like this, without noise,

and this with noise ( line 12 of the code ).

The cyan line is the underlying price and the red line is the Runge-Kutta 1 bar ahead projection. As can be seen, when the price is moving in rather straight lines the projection is quite accurate, however, at turnings there is some overshoot, which is to be expected. I'm not unduly concerned about this overshoot as my intention is simply to get the various k values as features, but this overshoot does have some implications which I will discuss in a future post.

Friday, 25 September 2015

Runge-Kutta Methods

As stated in my previous post I have been focusing on getting some meaningful features as possible inputs to my machine learning based trading system, and one of the possible ideas that has caught my attention is using Runge-Kutta methods to project ( otherwise known as "guessing" ) future price evolution. I have used this sort of approach before in the construction of my perfect oscillator ( links here, here, here and here ) but I deem this approach unsuitable for the type of "guessing" I want to do for my mfe-mae indicator as I need to "guess" rolling maximum highs and minimum lows, which quite often are actually flat for consecutive price bars. In fact, what I want to do is predict/guess upper and lower bounds for future price evolution, which might actually turn out to be a more tractable problem than predicting price itself.

The above wiki link to Runge-Kutta methods is a pretty dense mathematical read and readers may be wondering how approximation of solutions to ordinary differential equations can possibly relate to my stated aim, however the following links visualise Runge-Kutta in an accessible way:

Readers should hopefully see that what Runge-Kutta basically does is compute a "future" position given a starting value and a function to calculate slopes. When it comes to prices our starting value can be taken to be the most recent price/OHLC/bid/offer/tick in the time series, and whilst I don't possess a mathematical function to exactly calculate future evolution of price slopes, I can use my Savitzky Golay filter convolution code to approximate these slopes. The final icing on this cake is the weighting scheme for the Runge-Kutta method, as shown in the buttersblog link above, i.e.
$y_{n+1} = y_n + {\color{magenta}b_1}k_1 + {\color{magenta}b_2}k_2 + {\color{magenta}b_3}k_3 + {\color{magenta}b_4}k_4 + \cdots + {\color{magenta}b_s}k_s$
which is just linear regression on the k slope values with the most recent price set as the intercept term! Hopefully this will be a useful way to generate features for my conditional restricted boltzmann machine, and if I use regularized linear regression I might finally be able to use my particle swarm optimisation code.

More in due course.

Thursday, 10 September 2015

Recent reading

In my last post I mentioned that I was going away for the summer, but now I'm back. During the summer I didn't get to do as much reading etc. as I had hoped, but I did manage to play around with the Rssa package for singular spectrum analysis and this is still an ongoing investigation. I also briefly looked at independent component analysis and the FastICA package.

The common theme of the above is the extraction of meaningful time series features, and this general area is what I will be looking into for my next set of posts.

Thursday, 11 December 2014

MFE/MAE Indicator Test Results

Following on from the previous post, the test I outlined in that post wasn't very satisfactory, which I put down to the fact that the Sigmoid transformation of the raw MFE/MAE indicator values is not amenable to the application of standard deviation as a meaningful measure. Instead, I changed the test to one based on the standard error of the mean, an example screen shot of which is shown below:-

The top pane shows the the long version of the indicator and the bottom pane the short version. In each there are upper and lower limits of the sample standard error of the mean above and below the population mean (mean of all values of the indicator) along with the cumulative mean value of the top N matches as shown on the x-axis. In this particular example it can be seen that around the 170-180 samples mark the cumulative mean moves inside the standard error limits, never to leave them again. The meaning I ascribe to this is that there is no value to be gained from using more than approximately 180 samples for machine learning purposes, for this example, as to use more samples would be akin to training on all available data, which makes the use of my Cauchy Schwarz matching algo superfluous. I repeated the above on all instances of sigmoid transformed and untransformed MFE/MAE indicator values to get an average of 325 samples for transformed, and an average of 446 samples for the untransformed indicator values across the 4 major forex pairs. Based on this, I have decided to use the top 450 Cauchy Schwarz matches for training purposes, which has ramifications for model complexity will be discussed shortly.

Returning to the above screen shot, the figure 2 inset shows the price bars that immediately follow the price bar for which the main screen shows the top N matches. Looking at the extreme left of the main screen it can be seen that the lower pane, short indicator has an almost maximum reading of 1 whilst the upper pane, long indicator shows a value of approx. 2.7, which is not much above the global minimum for this indicator and well below the 0.5 neutral level. This strongly suggests a short position, and looking at the inset figure it can be seen that over the 3 days following the extreme left matched bar a short position was indeed the best position to hold. This is a pattern that seems to frequently present itself during visual inspection of charts, although I am unable to quantify this in any way.

On the matter of model complexity alluded to above, I found the Learning From Data course I have recently completed on the edX platform to be very enlightening, particularly the concept of the VC dimension, which is nicely explained in the Learning From Data Video library. I'll leave it to interested readers to follow the links, but the big take away for me is that using 450 samples as described above implies that my final machine learning model must have an upper bound of approximately 45 on the VC dimension, which in turn implies a maximum of 45 weights in the neural net. This is a design constraint that I will discuss in a future post.

Pages