Dekalog Blog: RBM

Showing posts with label RBM. Show all posts

Thursday, 31 March 2016

Parallel Tempering and Adaptive Learning Rates in Restricted Boltzmann Machine Learning

It has been a while since my last post and in the intervening time I have been busy working on the code of my previous few posts.

During the course of this I have noticed that there are some further improvements to be made in terms of robustness etc. inspired by this Master's thesis, Improved Learning Algorithms for Restricted Boltzmann Machines, by KyungHyun Cho. Using the Deepmat Toolbox code available here as a guide, I now intend to further improve my code by incorporating the concepts of Parallel Tempering and adaptive learning rates for both the RBM and CRBM training.

More in due course.

Wednesday, 3 February 2016

Refactored Denoising Autoencoder Update #2

Below is this second code update.

%  select rolling window length to use - an optimisable parameter via pso?
rolling_window_length = 50 ;
batchsize = 5 ;

%  how-many timesteps do we look back for directed connections - this is what we call the "order" of the model 
n1 = 3 ; % first "gaussian" layer order, a best guess just for batchdata creation purposes
n2 = 3 ; % second "binary" layer order, a best guess just for batchdata creation purposes

%  taking into account rolling_window_length, n1, n2 and batchsize, get total lookback length
remainder = rem( ( rolling_window_length + n1 + n2 ) , batchsize ) ;

if ( remainder > 0 ) % number of training examples with lookback and orders n1 and n2 not exactly divisable by batchsize
lookback_length = ( rolling_window_length + n1 + n2 + ( batchsize - remainder ) ) ; % increase the lookback_length
else                 % number of training examples with lookback and orders n1 and n2 exactly divisable by batchsize
lookback_length = ( rolling_window_length + n1 + n2 ) ;
end

%  create batchdataindex using lookback_length to index bars in the features matrix
batchdataindex = ( ( training_point_index - ( lookback_length - 1 ) ) : 1 : training_point_index )' ;
batchdata = features( batchdataindex , : ) ;

%  now that the batchdata has been created, check it for autocorrelation in the features
all_ar_coeff = zeros( size( batchdata , 2 ) , 1 ) ;

  for ii = 1 : size( batchdata , 2 )
  ar_coeffs = arburg( batchdata( : , ii ) , 10 , 'FPE' ) ;
  all_ar_coeff( ii ) = length( ar_coeffs ) - 1 ;
  end
  
%  set order of gaussian_crbm, n1, to be equal to the average length of any autocorrelation in the data
n1 = round( mean( all_ar_coeff ) ) ;  

%  z-normalise the batchdata matrix with the mean and std of columns 
data_mean = mean( batchdata , 1 ) ;
data_std = std( batchdata , 1 ) ;
batchdata = ( batchdata .- repmat( data_mean , size( batchdata , 1 ) , 1 ) ) ./ repmat( data_std , size( batchdata , 1 ) , 1 ) ; % batchdata is now z-normalised by data_mean & data_std

%  create the minibatch index matrix for gaussian rbm pre-training of directed weights w
minibatch = ( 1 : 1 : size( batchdata , 1 ) ) ; minibatch( 1 : ( size( batchdata , 1 ) - rolling_window_length ) ) = [] ;
minibatch = minibatch( randperm( size( minibatch , 2 ) ) ) ; minibatch = reshape( minibatch , batchsize , size( minibatch , 2 ) / batchsize ) ; 

% PRE-TRAINING FOR THE VISABLE TO HIDDEN AND THE VISIBLE TO VISIBLE WEIGHTS %%%%
% First create a training set and target set for the pre-training of gaussian layer
dAuto_Encode_targets = batchdata ; dAuto_Encode_training_data = [] ;
% dAuto_Encode_targets = batchdata( : , 2 : end ) ; dAuto_Encode_training_data = [] ; % if bias added to raw data
  
  % loop to create the dAuto_Encode_training_data ( n1 == "order" of the gaussian layer of crbm )
  for ii = 1 : n1
  dAuto_Encode_training_data = [ dAuto_Encode_training_data shift( batchdata , ii ) ] ;
  end

% now delete the first n1 rows due to circular shift induced mismatch of data and targets
dAuto_Encode_targets( 1 : n1 , : ) = [] ; dAuto_Encode_training_data( 1 : n1 , : ) = [] ;

% DO RBM PRE-TRAINING FOR THE BOTTOM UP DIRECTED WEIGHTS W %%%%%%%%%%%%%%%%%%%%%
% use rbm trained initial weights instead of using random initialisation for weights
% Doing this because we are not using regularisation in the autoencoder pre-training
epochs = 10000 ;
hidden_layer_size = 4 * size( dAuto_Encode_targets , 2 ) ;
[ w_weights , w_weights_hid_bias , w_weights_vis_bias ] = cc_gaussian_rbm( dAuto_Encode_targets , minibatch , epochs , hidden_layer_size , 0.05 ) ;
% keep a copy of these original w_weights
w1 = w_weights ;
[ A_weights , A_weights_hid_bias , A_weights_vis_bias ] = cc_gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , size( dAuto_Encode_targets , 2 ) , 0.05 ) ;
[ B_weights , B_weights_hid_bias , B_weights_vis_bias ] = cc_gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , hidden_layer_size , 0.05 ) ;

% END OF RBM PRE-TRAINING OF AUTOENCODER WEIGHTS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

figure(1) ; surf( A_weights ) ; title( 'A Weights after RBM training' ) ;
figure(2) ; surf( B_weights ) ; title( 'B Weights after RBM training' ) ;
figure(3) ; surf( w_weights ) ; title( 'w Weights after RBM training' ) ;
figure(4) ; plot( A_weights_hid_bias , 'b' , B_weights_hid_bias , 'r' , w_weights_vis_bias , 'g' ) ; title( 'Biases after RBM training' ) ; legend( 'A' , 'B' , 'w' ) ;

% DO THE AUTOENCODER TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% create weight update matrices
A_weights_update = zeros( size( A_weights ) ) ;
A_weights_hid_bias_update = zeros( size( A_weights_hid_bias ) ) ;
B_weights_update = zeros( size( B_weights ) ) ;
B_weights_hid_bias_update = zeros( size( B_weights_hid_bias ) ) ;
w_weights_update = zeros( size( w_weights ) ) ;
w_weights_vis_bias_update = zeros( size( w_weights_vis_bias ) ) ;

% for adagrad
historical_A = zeros( size( A_weights ) ) ;
historical_A_hid_bias = zeros( size( A_weights_hid_bias ) ) ;
historical_B = zeros( size( B_weights ) ) ;
historical_B_hid_bias = zeros( size( B_weights_hid_bias ) ) ;
historical_w = zeros( size( w_weights ) ) ;
historical_w_vis_bias = zeros( size( w_weights_vis_bias ) ) ;

% set some training parameters
n = size( dAuto_Encode_training_data , 1 ) ; % number of training examples in dAuto_Encode_training_data
input_layer_size = size( dAuto_Encode_training_data , 2 ) ;
fudge_factor = 1e-6 ; % for numerical stability for adagrad
learning_rate = 0.01 ; % will be changed to 0.001 after 50 iters through epoch loop
mom = 0 ;            % will be changed to 0.9 after 50 iters through epoch loop
noise = 0.5 ;
epochs = 1000 ;
cost = zeros( epochs , 1 ) ;
lowest_cost = inf ;

  % Stochastic Gradient Descent training over dAuto_Encode_training_data 
  for iter = 1 : epochs
   
      % change momentum and learning_rate after 50 iters
      if iter == 50
      mom = 0.9 ;
      learning_rate = 0.001 ;
      end
  
      index = randperm( n ) ; % randomise the order of training examples
     
      for training_example = 1 : n
      
      % Select data for this training batch
      tmp_X = dAuto_Encode_training_data( index( training_example ) , : ) ;
      tmp_T = dAuto_Encode_targets( index( training_example ) , : ) ;
      
      % Randomly black out some of the input training data
      tmp_X( rand( size( tmp_X ) ) < noise ) = 0 ;
      
      % feedforward tmp_X through B_weights and get sigmoid e.g ret = 1.0 ./ ( 1.0 + exp(-input) )
      tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( - ( tmp_X * B_weights .+ B_weights_hid_bias ) ) ) ;
      
      % Randomly black out some of tmp_X_through_sigmoid for dropout training
      tmp_X_through_sigmoid( rand( size( tmp_X_through_sigmoid ) ) < noise ) = 0 ;
    
      % feedforward tmp_X through A_weights and add to tmp_X_through_sigmoid * w_weights for linear output layer
      final_output_layer = ( tmp_X * A_weights .+ A_weights_hid_bias ) .+ ( tmp_X_through_sigmoid * w_weights' .+ w_weights_vis_bias ) ;
    
      % now do backpropagation
      % this is the derivative of weights for the linear final_output_layer
      delta_out = ( tmp_T - final_output_layer ) ;
      
      % NOTE! gradient of sigmoid function g = sigmoid(z) .* ( 1.0 .- sigmoid(z) )
      sig_grad = tmp_X_through_sigmoid .* ( 1 .- tmp_X_through_sigmoid ) ; 
      
      % backpropagation only through the w_weights that are connected to tmp_X_through_sigmoid
      delta_hidden = ( delta_out * w_weights ) .* sig_grad ;
      
      % apply deltas from backpropagation with adagrad for the weight updates
      historical_A = historical_A .+ ( tmp_X' * delta_out ).^2 ;    
      A_weights_update = mom .* A_weights_update .+ ( learning_rate .* ( tmp_X' * delta_out ) ) ./ ( fudge_factor .+ sqrt( historical_A ) ) ;
      
      historical_A_hid_bias = historical_A_hid_bias .+ delta_out.^2 ;
      A_weights_hid_bias_update = mom .* A_weights_hid_bias_update .+ ( learning_rate .* delta_out ) ./ ( fudge_factor .+ sqrt( historical_A_hid_bias ) ) ;
      
      historical_w = historical_w .+ ( delta_out' * tmp_X_through_sigmoid ).^2 ;
      w_weights_update = mom .* w_weights_update .+ ( learning_rate .* ( delta_out' * tmp_X_through_sigmoid ) ) ./ ( fudge_factor .+ sqrt( historical_w ) ) ;
      
      historical_w_vis_bias = historical_w_vis_bias .+ delta_out.^2 ;
      w_weights_vis_bias_update = mom .* w_weights_vis_bias_update .+ ( learning_rate .* delta_out ) ./ ( fudge_factor .+ sqrt( historical_w_vis_bias ) ) ;
      
      historical_B = historical_B .+ ( tmp_X' * delta_hidden ).^2 ;
      B_weights_update = mom .* B_weights_update .+ ( learning_rate .* ( tmp_X' * delta_hidden ) ) ./ ( fudge_factor .+ sqrt( historical_B ) ) ;
      
      historical_B_hid_bias = historical_B_hid_bias .+ delta_hidden.^2 ;
      B_weights_hid_bias_update = mom .* B_weights_hid_bias_update .+ ( learning_rate .* delta_hidden ) ./ ( fudge_factor .+ sqrt( historical_B_hid_bias ) ) ;
      
      % update the weight matrices with weight_updates
      A_weights = A_weights + A_weights_update ;
      A_weights_hid_bias = A_weights_hid_bias + A_weights_hid_bias_update ;
      B_weights = B_weights + B_weights_update ;
      B_weights_hid_bias = B_weights_hid_bias + B_weights_hid_bias_update ;
      w_weights = w_weights + w_weights_update ;
      w_weights_vis_bias = w_weights_vis_bias + w_weights_vis_bias_update ;
      
      end % end of training_example loop
  
  % feedforward with this epoch's updated weights
  epoch_trained_tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( -( dAuto_Encode_training_data * B_weights .+ repmat( B_weights_hid_bias , size( dAuto_Encode_training_data , 1 ) , 1 ) ) ) ) ;
  epoch_trained_output = ( dAuto_Encode_training_data * A_weights .+ repmat( A_weights_hid_bias , size( dAuto_Encode_training_data , 1 ) , 1 ) )...
                          .+ ( epoch_trained_tmp_X_through_sigmoid * w_weights' .+ repmat( w_weights_vis_bias , size( epoch_trained_tmp_X_through_sigmoid , 1 ) , 1 ) ) ;
 
  % get sum squared error cost
  cost( iter , 1 ) = sum( sum( ( dAuto_Encode_targets .- epoch_trained_output ) .^ 2 ) ) ;
  
    % record best so far
    if cost( iter , 1 ) <= lowest_cost
       lowest_cost = cost( iter , 1 ) ;
       iter_min = iter ;
       best_A = A_weights ;
       best_B = B_weights ;
       best_w = w_weights ;
    end
  
  end % end of backpropagation epoch loop

% plot weights
figure(5) ; surf( best_A ) ; title( 'Best A Weights' ) ;
figure(6) ; surf( best_B ) ; title( 'Best B Weights' ) ;
figure(7) ; surf( best_w ) ; title( 'Best w Weights' ) ;
figure(8) ; plot( A_weights_hid_bias , 'b' , B_weights_hid_bias , 'r' , w_weights_vis_bias , 'g' ) ; title( 'Biases after Autoencoder training' ) ; legend( 'A' , 'B' , 'w' ) ;
figure(9) ; plot( cost ) ; title( 'Evolution of Autoencoder cost' ) ;

% END OF CRBM WEIGHT PRE-TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The changes from the previous code update are a slightly different way to handle the bias units, the introduction of hidden and visible bias units from Restricted Boltzmann machine (RBM) pre-training and the introduction of an automated way to select the "order" of the Conditional Restricted Boltzmann machine (CRBM).

The order of a CRBM is how many time steps we look back in order to model the autoregressive components. This could be decided heuristically or through cross validation but I have decided to use the Octave "arburg" function to "auto-magically" select this look back length, the idea being that the data itself informs this decision and makes the whole CRBM training algorithm adaptive to current conditions. Since the ultimate point of the CRBM will be to make predictions of future OHLC values I have chosen to use the final prediction error model selection criteria for the arburg function.

Now that the bulk of this coding has been completed I think it would be useful to describe the proposed work flow of the various components.

the data and its derived inputs, such as indicators etc, are input to a Gaussian RBM as a weight initialisation step for the denoising autoencoder training. A Gaussian RBM is used because the data are real valued and not binary. This step is typical of what happens in deep learning and helps to extract meaningful features from the raw data in an unsupervised manner
the data and RBM initialised weights are then input to the denoising autoencoder to further model the weights and to take into account the autoregressive components of the data
these twice modelled weights are then used as the initial weights for the CRBM training of a Gaussian-Binary CRBM layer
the hidden layer of the above Gaussian-Binary CRBM is then used as data for a second Binary-Binary CRBM layer which will be stacked. The training for this second layer will follow the format above, i.e. RBM and denoising autoencoder pre-training of weights

The next step will be for me to compile the denoising autoencoder code into an Octave C++ .oct function for speed optimisation purposes.

Thursday, 21 January 2016

Refactored Denoising Autoencoder Code Update

This code box contains updated code from my previous post. The main change is the inclusion of bias units for the directed auto-regressive weights and the visible to hidden weights. In addition there is code showing how data is pre-processed into batches/targets for the pre-training and code showing how the weight matrices are manipulated into a form suitable for my existing optimised crbm code for gaussian units.

%  select rolling window length to use - an optimisable parameter via pso?
rolling_window_length = 50 ;

%  how-many timesteps do we look back for directed connections - this is what we call the "order" of the model 
n1 = 3 ; % first "gaussian" layer order
n2 = 3 ; % second "binary" layer order
batchsize = 5 ;

%  taking into account rolling_window_length, n1, n2 and batchsize, get total lookback length
remainder = rem( ( rolling_window_length + n1 + n2 ) , batchsize ) ;

if ( remainder > 0 ) % number of training examples with lookback and orders n1 and n2 not exactly divisable by batchsize
lookback_length = ( rolling_window_length + n1 + n2 + ( batchsize - remainder ) ) ; % increase the lookback_length
else                 % number of training examples with lookback and orders n1 and n2 exactly divisable by batchsize
lookback_length = ( rolling_window_length + n1 + n2 ) ;
end

%  create batchdataindex using lookback_length to index bars in the features matrix
batchdataindex = ( ( training_point_index - ( lookback_length - 1 ) ) : 1 : training_point_index )' ;
batchdata = features( batchdataindex , : ) ;

%  z-normalise the batchdata matrix with the mean and std of columns 
data_mean = mean( batchdata , 1 ) ;
data_std = std( batchdata , 1 ) ;
batchdata = ( batchdata .- repmat( data_mean , size( batchdata , 1 ) , 1 ) ) ./ repmat( data_std , size( batchdata , 1 ) , 1 ) ; % batchdata is now z-normalised by data_mean & data_std
% add bias neurons
batchdata = [ ones( size( batchdata , 1 ) , 1 ) batchdata ] ;

%  create the minibatch index matrix for gaussian rbm pre-training of directed weights w
minibatch = ( 1 : 1 : size( batchdata , 1 ) ) ; minibatch( 1 : ( size( batchdata , 1 ) - rolling_window_length ) ) = [] ;
minibatch = minibatch( randperm( size( minibatch , 2 ) ) ) ; minibatch = reshape( minibatch , batchsize , size( minibatch , 2 ) / batchsize ) ; 

% PRE-TRAINING FOR THE VISABLE TO HIDDEN AND THE VISIBLE TO VISIBLE WEIGHTS %%%%
% First create a training set and target set for the pre-training
dAuto_Encode_targets = batchdata( : , 2 : end ) ; dAuto_Encode_training_data = [] ;
  
  % loop to create the dAuto_Encode_training_data ( n1 == "order" of the gaussian layer of crbm )
  for ii = 1 : n1
  dAuto_Encode_training_data = [ dAuto_Encode_training_data shift( batchdata , ii ) ] ;
  end

% now delete the first n1 rows due to circular shift induced mismatch of data and targets
dAuto_Encode_targets( 1 : n1 , : ) = [] ; dAuto_Encode_training_data( 1 : n1 , : ) = [] ;
% add bias
%dAuto_Encode_training_data = [ ones( size( dAuto_Encode_training_data , 1 ) , 1 ) dAuto_Encode_training_data ] ; 
% bias units idx
bias_idx = ( 1 : size( batchdata , 2 ) : size( dAuto_Encode_training_data , 2 ) ) ;

% DO RBM PRE-TRAINING FOR THE BOTTOM UP DIRECTED WEIGHTS W %%%%%%%%%%%%%%%%%%%%%
% use rbm trained initial weights instead of using random initialisation for weights
% Doing this because we are not using regularisation in the autoencoder pre-training
epochs = 10000 ;
hidden_layer_size = 2 * size( dAuto_Encode_targets , 2 ) ;
w_weights = gaussian_rbm( dAuto_Encode_targets , minibatch , epochs , hidden_layer_size ) ;
% keep a copy of these original w_weights
w1 = w_weights ;
A_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , size( dAuto_Encode_targets , 2 ) ) ;
B_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , hidden_layer_size ) ;

% create weight update matrices
A_weights_update = zeros( size( A_weights ) ) ;
B_weights_update = zeros( size( B_weights ) ) ;
w_weights_update = zeros( size( w_weights ) ) ;

% for adagrad
historical_A = zeros( size( A_weights ) ) ;
historical_B = zeros( size( B_weights ) ) ;
historical_w = zeros( size( w_weights ) ) ;

% set some training parameters
n = size( dAuto_Encode_training_data , 1 ) ; % number of training examples in dAuto_Encode_training_data
input_layer_size = size( dAuto_Encode_training_data , 2 ) ;
fudge_factor = 1e-6 ; % for numerical stability for adagrad
learning_rate = 0.1 ; % will be changed to 0.01 after 50 iters through epoch loop
mom = 0 ;             % will be changed to 0.9 after 50 iters through epoch loop
noise = 0.5 ;
epochs = 1000 ;
cost = zeros( epochs , 1 ) ;
lowest_cost = inf ;

  % Stochastic Gradient Descent training over dAuto_Encode_training_data 
  for iter = 1 : epochs
   
      % change momentum and learning_rate after 50 iters
      if iter == 50
      mom = 0.9 ;
      learning_rate = 0.01 ;
      end
  
      index = randperm( n ) ; % randomise the order of training examples
     
      for training_example = 1 : n
      
      % Select data for this training batch
      tmp_X = dAuto_Encode_training_data( index( training_example ) , : ) ;
      tmp_T = dAuto_Encode_targets( index( training_example ) , : ) ;
      
      % Randomly black out some of the input training data
      tmp_X( rand( size( tmp_X ) ) < noise ) = 0 ;
      % but keep bias units
      tmp_X( bias_idx ) = 1 ;
      
      % feedforward tmp_X through B_weights and get sigmoid e.g ret = 1.0 ./ ( 1.0 + exp(-input) )
      tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( - B_weights * tmp_X' ) ) ;
      
      % Randomly black out some of tmp_X_through_sigmoid for dropout training
      tmp_X_through_sigmoid( rand( size( tmp_X_through_sigmoid ) ) < noise ) = 0 ;
    
      % feedforward tmp_X through A_weights and add to tmp_X_through_sigmoid * w_weights for linear output layer
      final_output_layer = ( tmp_X * A_weights' ) .+ ( tmp_X_through_sigmoid' * w_weights ) ;
    
      % now do backpropagation
      % this is the derivative of weights for the linear final_output_layer
      delta_out = ( tmp_T - final_output_layer ) ;
      
      % NOTE! gradient of sigmoid function g = sigmoid(z) .* ( 1.0 .- sigmoid(z) )
      sig_grad = tmp_X_through_sigmoid .* ( 1 .- tmp_X_through_sigmoid ) ; 
      
      % backpropagation only through the w_weights that are connected to tmp_X_through_sigmoid
      delta_hidden = ( w_weights * delta_out' ) .* sig_grad ;
      
      % apply deltas from backpropagation with adagrad for the weight updates
      historical_A = historical_A .+ ( delta_out' * tmp_X ).^2 ;    
      A_weights_update = mom .* A_weights_update .+ ( learning_rate .* ( delta_out' * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_A ) ) ;
      
      historical_w = historical_w .+ ( tmp_X_through_sigmoid * delta_out ).^2 ;
      w_weights_update = mom .* w_weights_update .+ ( learning_rate .* ( tmp_X_through_sigmoid * delta_out ) ) ./ ( fudge_factor .+ sqrt( historical_w ) ) ;
      
      historical_B = historical_B .+ ( delta_hidden * tmp_X ).^2 ;
      B_weights_update = mom .* B_weights_update .+ ( learning_rate .* ( delta_hidden * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_B ) ) ;
      
      % update the weight matrices with weight_updates
      A_weights = A_weights + A_weights_update ;
      B_weights = B_weights + B_weights_update ;
      w_weights = w_weights + w_weights_update ;
      
      end % end of training_example loop
  
  % feedforward with this epoch's updated weights
  epoch_trained_tmp_X_through_sigmoid = ( 1.0 ./ ( 1.0 .+ exp( -( ( B_weights ) * dAuto_Encode_training_data' ) ) ) ) ;
  epoch_trained_output = ( dAuto_Encode_training_data * ( A_weights )' ) .+ ( epoch_trained_tmp_X_through_sigmoid' * ( w_weights ) ) ;
 
  % get sum squared error cost
  cost( iter , 1 ) = sum( sum( ( dAuto_Encode_targets .- epoch_trained_output ) .^ 2 ) ) ;
  
    % record best so far
    if cost( iter , 1 ) <= lowest_cost
       lowest_cost = cost( iter , 1 ) ;
       iter_min = iter ;
       best_A = A_weights ;
       best_B = B_weights ;
       best_w = w_weights ;
    end
  
  end % end of backpropagation loop

lowest_cost                                        % print final cost to terminal
iter_min ;                                           % and the iter it occured on
graphics_toolkit( "qt" ) ;
figure(1) ; plot( cost , 'r' , 'linewidth' , 3 ) ; % and plot the cost curve

% plot weights
graphics_toolkit( "gnuplot" ) ;
figure(2) ; surf( best_A ) ; title( 'Best A Weights' ) ;
figure(3) ; surf( best_B ) ; title( 'Best B Weights' ) ;
figure(4) ; surf( best_w ) ; title( 'Best w Weights' ) ;

% END OF CRBM WEIGHT PRE-TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% extract bias weights from best_A and best_B
A_bias = best_A( : , bias_idx ) ; best_A( : , bias_idx ) = [] ; A_bias = sum( A_bias , 2 ) ; 
B_bias = best_B( : , bias_idx ) ; best_B( : , bias_idx ) = [] ; B_bias = sum( B_bias , 2 ) ;

% now delete bias units from batchdata
batchdata( : , 1 ) = [] ;

% create reshaped structures to hold A_weights and B_weights
A1 = reshape( best_A , size( best_A , 1 ) , size( best_A , 2 ) / n1 , n1 ) ;
B1 = reshape( best_B , size( best_B , 1 ) , size( best_B , 2 ) / n1 , n1 ) ;

The following video shows the evolution of the weights whilst training over 20 consecutive price bars. The top three panes are the weights after the denoising autoencoder training and the bottom three are the same weights after being used as initialisation weights for the CRBM training and then being modified by this CRBM training. It is this final set of weights that would typically be used for CRBM generation.

non-embedded view

Monday, 18 January 2016

Refactored Denoising Autoencoder Code

It has been quite a challenge but I now have a working first attempt at Octave code for autoencoder pre-training of weights for a Conditional Restricted Boltzmann Machine, which is shown in the code box below.

% PRE-TRAINING FOR THE VISABLE TO HIDDEN AND THE VISIBLE TO VISIBLE WEIGHTS %%%%
% First create a training set and target set for the pre-training
dAuto_Encode_targets = batchdata ; dAuto_Encode_training_data = [] ;
  
  % loop to create the dAuto_Encode_training_data ( n1 == "order" of the gaussian layer of crbm )
  for ii = 1 : n1
  dAuto_Encode_training_data = [ dAuto_Encode_training_data shift( batchdata , ii ) ] ;
  end

% now delete the first n1 rows due to circular shift induced mismatch of data and targets
dAuto_Encode_targets( 1 : n1 , : ) = [] ; dAuto_Encode_training_data( 1 : n1 , : ) = [] ; 

% DO RBM PRE-TRAINING FOR THE BOTTOM UP DIRECTED WEIGHTS W %%%%%%%%%%%%%%%%%%%%%
% use rbm trained initial weights instead of using random initialisation for weights
% Doing this because we are not using regularisation in the autoencoder pre-training
epochs = 5000 ;
hidden_layer_size = 2 * size( dAuto_Encode_targets , 2 ) ;
w_weights = gaussian_rbm( dAuto_Encode_targets , minibatch , epochs , hidden_layer_size ) ;
A_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , size( dAuto_Encode_targets , 2 ) ) ;
B_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , hidden_layer_size ) ;

% create weight update matrices
A_weights_update = zeros( size( A_weights ) ) ;
B_weights_update = zeros( size( B_weights ) ) ;
w_weights_update = zeros( size( w_weights ) ) ;

% for adagrad
historical_A = zeros( size( A_weights ) ) ;
historical_B = zeros( size( B_weights ) ) ;
historical_w = zeros( size( w_weights ) ) ;

% set some training parameters
n = size( dAuto_Encode_training_data , 1 ) ; % number of training examples in dAuto_Encode_training_data
input_layer_size = size( dAuto_Encode_training_data , 2 ) ;
fudge_factor = 1e-6 ; % for numerical stability for adagrad
learning_rate = 0.1 ; % will be changed to 0.01 after 50 iters through epoch loop
mom = 0 ;             % will be changed to 0.9 after 50 iters through epoch loop
noise = 0.5 ;
epochs = 1000 ;
cost = zeros( epochs , 1 ) ;
lowest_cost = inf ;

  % Stochastic Gradient Descent training over dAuto_Encode_training_data 
  for iter = 1 : epochs
   
      % change momentum and learning_rate after 50 iters
      if iter == 50
      mom = 0.9 ;
      learning_rate = 0.01 ;
      end
  
      index = randperm( n ) ; % randomise the order of training examples
     
      for training_example = 1 : n
      
      % Select data for this training batch
      tmp_X = dAuto_Encode_training_data( index( training_example ) , : ) ;
      tmp_T = dAuto_Encode_targets( index( training_example ) , : ) ;
      
      % Randomly black out some of the input training data
      tmp_X( rand( size( tmp_X ) ) &lt; noise ) = 0 ;
      
      % feedforward tmp_X through B_weights and get sigmoid e.g ret = 1.0 ./ ( 1.0 + exp(-input) )
      tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( - B_weights * tmp_X' ) ) ;
      
      % Randomly black out some of tmp_X_through_sigmoid for dropout training
      tmp_X_through_sigmoid( rand( size( tmp_X_through_sigmoid ) ) &lt; noise ) = 0 ;
    
      % feedforward tmp_X through A_weights and add to tmp_X_through_sigmoid * w_weights for linear output layer
      final_output_layer = ( tmp_X * A_weights' ) .+ ( tmp_X_through_sigmoid' * w_weights ) ;
    
      % now do backpropagation
      % this is the derivative of weights for the linear final_output_layer
      delta_out = ( tmp_T - final_output_layer ) ;
      
      % NOTE! gradient of sigmoid function g = sigmoid(z) .* ( 1.0 .- sigmoid(z) )
      sig_grad = tmp_X_through_sigmoid .* ( 1 .- tmp_X_through_sigmoid ) ; 
      
      % backpropagation only through the w_weights that are connected to tmp_X_through_sigmoid
      delta_hidden = ( w_weights * delta_out' ) .* sig_grad ;
      
      % apply deltas from backpropagation with adagrad for the weight updates
      historical_A = historical_A .+ ( delta_out' * tmp_X ).^2 ;    
      A_weights_update = mom .* A_weights_update .+ ( learning_rate .* ( delta_out' * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_A ) ) ;
      
      historical_w = historical_w .+ ( tmp_X_through_sigmoid * delta_out ).^2 ;
      w_weights_update = mom .* w_weights_update .+ ( learning_rate .* ( tmp_X_through_sigmoid * delta_out ) ) ./ ( fudge_factor .+ sqrt( historical_w ) ) ;
      
      historical_B = historical_B .+ ( delta_hidden * tmp_X ).^2 ;
      B_weights_update = mom .* B_weights_update .+ ( learning_rate .* ( delta_hidden * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_B ) ) ;
      
      % update the weight matrices with weight_updates
      A_weights = A_weights + A_weights_update ;
      B_weights = B_weights + B_weights_update ;
      w_weights = w_weights + w_weights_update ;
      
      end % end of training_example loop
  
  % feedforward with this epoch's updated weights
  epoch_trained_tmp_X_through_sigmoid = ( 1.0 ./ ( 1.0 .+ exp( -( ( B_weights./2 ) * dAuto_Encode_training_data' ) ) ) ) ;
  epoch_trained_output = ( dAuto_Encode_training_data * ( A_weights./2 )' ) .+ ( epoch_trained_tmp_X_through_sigmoid' * ( w_weights./2 ) ) ;
 
  % get sum squared error cost
  cost( iter , 1 ) = sum( sum( ( dAuto_Encode_targets .- epoch_trained_output ) .^ 2 ) ) ;
  
    % record best so far
    if cost( iter , 1 ) &lt;= lowest_cost
       lowest_cost = cost( iter , 1 ) ;
       best_A = A_weights ./ 2 ;
       best_B = B_weights ./ 2 ;
       best_w = w_weights ./ 2;
    end
  
  end % end of backpropagation loop

lowest_cost                                        % print final cost to terminal
graphics_toolkit( "qt" ) ;
figure(1) ; plot( cost , 'r' , 'linewidth' , 3 ) ; % and plot the cost curve

% plot weights
graphics_toolkit( "gnuplot" ) ;
figure(2) ; surf( best_A ) ; title( 'Best A Weights' ) ;
figure(3) ; surf( best_B ) ; title( 'Best B Weights' ) ;
figure(4) ; surf( best_w ) ; title( 'Best w Weights' ) ;

% END OF CRBM WEIGHT PRE-TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Prior to this code being run there is a bit of data pre-processing to create data batches and their indexes, and of course afterwards the generated weights will be used as initial starting weights for the CRBM training.

I will not discuss the code very much in this post as it will probably be heavily revised in the near future. However, I will point out a few things about it, namely

I have used Restricted Boltzmann machine pre-training as initial weights for the autoencoder training
I have applied dropout to the single hidden layer in the autoencoder
early stopping, Momentum and AdaGrad are employed
bias units are not included
the code plots some charts for visual inspection (see below), which probably will not be present in the code's final version

My reason for using RBM pre-training instead of random initialisation of weights is simple - RBM training generally seems to result in a lower average cost when the trained autoencoder itself is applied to the training data.

Below is a short video of a desktop slide show which shows typical, final, pre-trained weights from a sequence of 20 training sessions. The first 10 slides are random initialisation and the last 10 are initialisation with RBM training. Note the smoother cost curve for the RBM initialisation. Apart from this initialisation all other training parameters were exactly the same for all training sessions.

Non embedded view
The lower pane shows the evolution of cost on the Y axis vs. epoch on the X axis. The upper three panes show the weights on what are called the A, B and w weights respectively. A are the direct auto-regressive weights from training data to target data without any intervening hidden layers, B are the weights to the hidden sigmoid layer and w are the weights from the hidden layer to the targets. It is interesting to see how the A weights remain relatively consistent over the different sessions.

Monday, 20 April 2015

Optimised CRBM Code for Binary Units

Following on from my previous post, in the code box below there is the C++ .oct code for training the binary units of the CRBM and, as with the code for training the Gaussian units, the code is liberally commented. There is only a slight difference between the two code files.

DEFUN_DLD ( cc_binary_crbm_mersenne , args , nargout ,
"-*- texinfo -*-\n\
@deftypefn {Function File} {} cc_binary_crbm_mersenne (@var{ batchdata , minibatch , nt , num_epochs , num_hid })\n\n\
This function trains a binary crbm. The value nt is the order of the model, i.e. how many previous values should be included,\n\
num_epochs is the number of training epochs, and num_hid is the number of nodes in the hidden layer.\n\
@end deftypefn" )

{
octave_value_list retval_list ;
int nargin = args.length () ;

// check the input arguments
if ( nargin != 5 )
   {
   error ( "Input error: there are insufficient input arguments. Type help for more details." ) ;
   return retval_list ;
   }

if ( !args(0).is_real_matrix () ) // check batchdata
   {
   error ( "Input error: the 1st argument, batchdata, is not a matrix type. Type help for more details." ) ;
   return retval_list ;
   }

if ( !args(1).is_real_matrix () ) // check minibatch
   {
   error ( "Input error: the 2nd argument, minibatch, is not a matrix type. Type help for more details." ) ;
   return retval_list ;
   }   
   
if ( args(2).length () != 1 ) // check nt
   {
   error ( "Input error: nt should be an interger value for the 'order' of the model. Type help for more details." ) ;
   return retval_list ;
   }
   
if ( args(3).length () != 1 ) // num_epochs
   {
   error ( "Input error: num_epochs should be an integer value for the number of training epochs. Type help for more details." ) ;
   return retval_list ;
   }
   
if ( args(4).length () != 1 ) // check num_hid
   {
   error ( "Input error: num_hid should be an integer for the number of nodes in hidden layer. Type help for more details." ) ;
   return retval_list ;
   }
     
if ( error_state )
   {
   error ( "Input error: type help for details." ) ;
   return retval_list ;
   }
// end of input checking
  
// inputs
Matrix batchdata = args(0).matrix_value () ;
Matrix minibatch = args(1).matrix_value () ;
int nt = args(2).int_value () ; // the "order" of the model
int num_epochs = args(3).int_value () ;
int num_hid = args(4).int_value () ;

// variables
// batchdata is a big matrix of all the data and we index it with "minibatch", a matrix of mini-batch indices in the columns                       
int num_cases = minibatch.rows () ;                                                     // Octave code ---> num_cases = length( minibatch{ batch } ) ;
int num_dims = batchdata.cols () ;                                                      // visible dimension
Matrix bi_star ( num_dims , num_cases ) ; bi_star.fill( 0.0 ) ;                         // Octave code ---> bi_star = zeros( num_dims , num_cases ) ; 
Matrix bj_star ( num_hid , num_cases ) ; bj_star.fill( 0.0 ) ;                          // Octave code ---> bj_star = zeros( num_hid , num_cases ) ; 
Matrix repmat_bj ( num_hid , num_cases ) ; repmat_bj.fill( 0.0 ) ;                      // for Octave code ---> repmat( bj , 1 , num_cases )
Matrix repmat_bi ( num_dims , num_cases ) ; repmat_bi.fill( 0.0 ) ;                     // for Octave code ---> repmat( bi , 1 , num_cases )
Matrix eta ( num_hid , num_cases ) ; eta.fill( 0.0 ) ;
Matrix h_posteriors ( num_hid , num_cases ) ; h_posteriors.fill( 0.0 ) ;                // for the logistic function
Matrix ones ( num_hid , num_cases ) ; ones.fill( 1.0 ) ;                                // for the logistic function
Matrix ones_2 ( num_cases , num_hid ) ; ones_2.fill( 1.0 ) ;                            // for the logistic function of negdata
Matrix hid_states ( num_cases , num_hid ) ; hid_states.fill( 0.0 ) ;                    // for hid_states = double( h_posteriors' > rand( num_cases , num_hid ) ) ;
Matrix w_grad ( num_hid , num_dims ) ; w_grad.fill( 0.0 ) ;                             // for w_grad = hid_states' * ( data( : , : , 1 ) ./ gsd ) ;
Matrix bi_grad ( num_dims , 1 ) ; bi_grad.fill( 0.0 ) ;                                 // for bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
Matrix bj_grad ( num_hid , 1 ) ; bj_grad.fill( 0.0 ) ;                                  // for bj_grad = sum( hid_states , 1 )' ;
Matrix topdown ( num_cases , num_dims ) ; topdown.fill( 0.0 ) ;                         // for topdown = gsd .* ( hid_states * w ) ;
Matrix negdata ( num_cases , num_dims ) ; negdata.fill( 0.0 ) ;
Matrix negdata_transpose ( num_dims , num_cases ) ; negdata_transpose.fill( 0.0 ) ;
Matrix bi_transpose ( 1 , num_dims ) ; bi_transpose.fill( 0.0 ) ;
Matrix repmat_bi_transpose ( num_cases , num_dims ) ; repmat_bi_transpose.fill( 0.0 ) ; 
Matrix neg_w_grad ( num_hid , num_dims ) ; neg_w_grad.fill( 0.0 ) ;
Matrix neg_bi_grad ( num_dims , 1 ) ; neg_bi_grad.fill( 0.0 ) ;                         // for neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
Matrix neg_bj_grad ( num_hid , 1 ) ; neg_bj_grad.fill( 0.0 ) ;                          // for neg_bj_grad = sum( h_posteriors , 2 ) ;
  
// Setting learning rates and create some utility matrices
Matrix epsilon_w ( num_hid , num_dims ) ; epsilon_w.fill( 0.001 ) ;   // undirected
Matrix epsilon_bi ( num_dims , 1 ) ; epsilon_bi.fill( 0.001 ) ;       // visibles
Matrix epsilon_bj ( num_hid , 1 ) ; epsilon_bj.fill( 0.001 ) ;        // hidden units
Matrix epsilon_A ( num_dims , num_dims ) ; epsilon_A.fill( 0.001 ) ;  // autoregressive
Matrix epsilon_B ( num_hid , num_dims ) ; epsilon_B.fill( 0.001 ) ;   // prev visibles to hidden
Matrix w_decay ( num_hid , num_dims ) ; w_decay.fill( 0.0002 ) ;      // currently we use the same weight decay for w, A, B
Matrix w_decay_A ( num_dims , num_dims ) ; w_decay_A.fill( 0.0002 ) ; // currently we use the same weight decay for w, A, B
Matrix w_decay_B ( num_hid , num_dims ) ; w_decay_B.fill( 0.0002 ) ;  // currently we use the same weight decay for w, A, B

Matrix momentum_w ( num_hid , num_dims ) ; momentum_w.fill( 0.0 ) ;   // momentum used only after 5 epochs of training, when it will be set to 0.9
Matrix num_cases_matrices_w_and_B ( num_hid , num_dims ) ; num_cases_matrices_w_and_B.fill( num_cases ) ;

Matrix momentum_bi ( num_dims , 1 ) ; momentum_bi.fill( 0.0 ) ;
Matrix num_cases_matrix_bi ( num_dims , 1 ) ; num_cases_matrix_bi.fill( num_cases ) ;

Matrix momentum_bj ( num_hid , 1 ) ; momentum_bj.fill( 0.0 ) ;
Matrix num_cases_matrix_bj ( num_hid , 1 ) ; num_cases_matrix_bj.fill( num_cases ) ;

Matrix momentum_A ( num_dims , num_dims ) ; momentum_A.fill( 0.0 ) ;
Matrix num_cases_matrix_A ( num_dims , num_dims ) ; num_cases_matrix_A.fill( num_cases ) ;

Matrix momentum_B ( num_hid , num_dims ) ; momentum_B.fill( 0.0 ) ;

// initialization of output matrices
Matrix w ( num_hid , num_dims ) ; w.fill( 0.0 ) ; // Octave code ---> w = 0.01 * randn( num_hid , num_dims ) ;
Matrix bi ( num_dims , 1 ) ; bi.fill( 0.0 ) ;     // Octave code ---> bi = 0.01 * randn( num_dims , 1 ) ;
Matrix bj( num_hid , 1 ) ; bj.fill( 0.0 ) ;       // Octave code ---> bj = -1 + 0.01 * randn( num_hid , 1 ) ; // set to favour units being "off"

// The autoregressive weights; A( : , : , j ) is the weight from t-j to the visible
NDArray A ( dim_vector( num_dims , num_dims , nt ) ) ; A.fill( 0.0 ) ; // Octave code ---> A = 0.01 * randn( num_dims ,num_dims , nt ) ;

// The weights from previous time-steps to the hiddens; B( : , : , j ) is the weight from t-j to the hidden layer
NDArray B ( dim_vector( num_hid , num_dims , nt ) ) ; B.fill( 0.0 ) ;  // Octave code ---> B = 0.01 * randn( num_hid , num_dims , nt ) ;

// Declare MersenneTwister random values
MTRand mtrand1 ;
double rand_norm_value ;
double rand_uniform_value ;

// nested loops to fill w, bi, bj, A and B with initial random values
for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_hid ; ii_nest_loop++ )
    { 
    rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
    bj ( ii_nest_loop , 0 ) = -1.0 + rand_norm_value ; // set to favour units being "off"

     for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims ; jj_nest_loop++ )
         {
         rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
         w ( ii_nest_loop , jj_nest_loop ) = rand_norm_value ;
         }
    }
    
for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_dims ; ii_nest_loop++ )
    {
    rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
    bi ( ii_nest_loop , 0 ) = rand_norm_value ;
    }

for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
    {
      for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_dims ; ii_nest_loop++ )
          {
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
               {
               rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
               A ( ii_nest_loop , jj_nest_loop , hh ) = rand_norm_value ;
               }  // end of jj_nest_loop loop      
          }       // end of ii_nest_loop loop
    }             // end of hh loop
    
for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
    {
      for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_hid ; ii_nest_loop++ )
          {
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
               {
               rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
               B ( ii_nest_loop , jj_nest_loop , hh ) = rand_norm_value ;
               }  // end of jj_nest_loop loop      
          }       // end of ii_nest_loop loop
    }             // end of hh loop        
    
// keep previous updates around for momentum
Matrix w_update ( num_hid , num_dims ) ; w_update.fill( 0.0 ) ; // Octave code ---> w_update = zeros( size( w ) ) ;
Matrix bi_update ( num_dims , 1 ) ; bi_update.fill( 0.0 ) ;     // Octave code ---> bi_update = zeros( size( bi ) ) ;
Matrix bj_update ( num_hid , 1 ) ; bj_update.fill( 0.0 ) ;      // Octave code ---> bj_update = zeros( size( bj ) ) ;

NDArray A_update ( dim_vector( num_dims , num_dims , nt ) ) ; A_update.fill( 0.0 ) ; // Octave code ---> A_update = zeros( size( A ) ) ;
Matrix A_extract ( num_dims , num_dims ) ; A_extract.fill( 0.0 ) ;                   // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix A_update_extract ( num_dims , num_dims ) ; A_update_extract.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray B_update ( dim_vector( num_hid , num_dims , nt ) ) ; B_update.fill( 0.0 ) ;  // Octave code ---> B_update = zeros( size( B ) ) ;
Matrix B_extract ( num_hid , num_dims ) ; B_extract.fill( 0.0 ) ;                    // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix B_update_extract ( num_hid , num_dims ) ; B_update_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArray

// data is a nt+1 dimensional array with current and delayed data corresponding to mini-batches
// num_cases = minibatch( batch ).length () ;
num_cases = minibatch.rows () ;
NDArray data ( dim_vector( num_cases , num_dims , nt + 1 ) ) ; data.fill( 0.0 ) ; // Octave code ---> data = zeros( num_cases , num_dims , nt + 1 ) ;
Matrix data_extract ( num_cases , num_dims ) ; data_extract.fill( 0.0 ) ;         // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_transpose ( num_dims , num_cases ) ; data_transpose.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_0 ( num_cases , num_dims ) ; data_0.fill( 0.0 ) ;                     // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_0_transpose ( num_dims , num_cases ) ; data_0_transpose.fill( 0.0 ) ; // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray A_grad ( dim_vector( num_dims , num_dims , nt ) ) ; A_grad.fill( 0.0 ) ;  // for A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;                                    
Matrix A_grad_extract ( num_dims , num_dims ) ; A_grad_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray neg_A_grad ( dim_vector( num_dims , num_dims , nt ) ) ; neg_A_grad.fill( 0.0 ) ; // for neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
Matrix neg_A_grad_extract ( num_dims , num_dims ) ; neg_A_grad_extract.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray B_grad ( dim_vector( num_hid , num_dims , nt ) ) ; B_grad.fill( 0.0 ) ;          // for B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
Matrix B_grad_extract ( num_hid , num_dims ) ; B_grad_extract.fill( 0.0 ) ;              // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray neg_B_grad ( dim_vector( num_hid , num_dims , nt ) ) ; neg_B_grad.fill( 0.0 ) ;  // for neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;
Matrix neg_B_grad_extract ( num_hid , num_dims ) ; neg_B_grad_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

Array  p ( dim_vector ( nt , 1 ) , 0 ) ;                                // vector for writing to A_grad and B_grad

// end of initialization of matrices

// %%%%%%%%% THE MAIN CODE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

for ( octave_idx_type epoch ( 0 ) ; epoch < num_epochs ; epoch++ ) // main epoch loop
    { 
//     // errsum = 0 ; % keep a running total of the difference between data and recon
//       
 for ( octave_idx_type batch ( 0 ) ; batch < minibatch.cols () ; batch++ ) // Octave code ---> int num_batches = minibatch.length () ;
     {
// %%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%      

// // These next nested loops fill the data NDArray with the values from batchdata indexed
// // by the column values of minibatch. Octave code equivalent given below:-
// // Octave code ---> mb = minibatch{ batch } ; % caches the indices   
// // Octave code ---> data( : , : , 1 ) = batchdata( mb , : ) ;
// // Octave code ---> for hh = 1 : nt
// // Octave code ---> data( : , : , hh + 1 ) = batchdata( mb - hh , : ) ;
// // Octave code ---> end
     for ( octave_idx_type hh ( 0 ) ; hh < nt + 1 ; hh++ )
         {
    for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_cases ; ii_nest_loop++ )
        {
   for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
       {
       data( ii_nest_loop , jj_nest_loop , hh ) = batchdata( minibatch( ii_nest_loop , batch ) - 1 - hh , jj_nest_loop ) ; // -1 for .oct zero based indexing vs Octave's 1 based
       } // end of jj_nest_loop loop      
        }       // end of ii_nest_loop loop
         }             // end of hh loop

// The above data filling loop could perhaps be implemented more quickly using the fortrans vec method as below
// http://stackoverflow.com.80bola.com/questions/28900153/create-a-ndarray-in-an-oct-file-from-a-double-pointer
// NDArray a (dim_vector(dim[0], dim[1], dim[2]));
// 
// Then loop over (i, j, k) indices to copy the cube to the octave array
// 
// double* a_vec = a.fortran_vec ();
// for (int i = 0; i < dim[0]; i++) {
//     for (int j = 0; j < dim[1]; j++) {
//         for (int k = 0; k < dim[2]; k++) {
//             *a_vec++ = armadillo_cube(i, j, k);
//         }
//     }
// }

// calculate contributions from directed autoregressive connections and contributions from directed visible-to-hidden connections
            bi_star.fill( 0.0 ) ; // Octave code ---> bi_star = zeros( num_dims , num_cases ) ; ( matrix declared earlier in code above )
            bj_star.fill( 0.0 ) ; // Octave code ---> bj_star = zeros( num_hid , num_cases ) ; ( matrix declared earlier in code above )
            
            // The code below calculates two separate Octave code loops in one nested C++ loop structure, namely
            // Octave code ---> for hh = 1 : nt       
            //                      bi_star = bi_star +  A( : , : , hh ) * data( : , : , hh + 1 )' ;
            //                  end 
            // and
            // Octave code ---> for hh = 1:nt
            //                      bj_star = bj_star + B( : , : , hh ) * data( : , : , hh + 1 )' ;
            //                  end 
            
            for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
                {
                // fill the intermediate calculation matrices 
                A_extract = A.page ( hh ) ;
                B_extract = B.page ( hh ) ;
                data_transpose = ( data.page ( hh + 1 ) ).transpose () ;

                // add up the hh different matrix multiplications
                bi_star += A_extract * data_transpose ; 
                bj_star += B_extract * data_transpose ;
     
         } // end of hh loop
         
                // extract and pre-calculate to save time in later computations
                data_0 = data.page ( 0 ) ;
                data_0_transpose = data_0.transpose () ;
 
// Calculate "posterior" probability -- hidden state being on ( Note that it isn't a true posterior )   
//          Octave code ---> eta = w * ( data( : , : , 1 ) ./ gsd )' + ... % bottom-up connections
//                                 repmat( bj , 1 , num_cases ) + ...      % static biases on unit
//                                 bj_star ;                               % dynamic biases

//          get repmat( bj , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
            for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_cases ; jj_nest_loop++ ) // loop over the columns
         {
         repmat_bj.insert( bj , 0 , jj_nest_loop ) ; 
         }

            // bottom_up = w * data( : , : , 1 )' ;
            eta = w * data_0_transpose + repmat_bj + bj_star ;       
 
            // h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ;    % logistic
            // -exponate the eta matrix
            for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
                { 
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
               {
               eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
               }
         }

            // element division  A./B == quotient(A,B) 
            h_posteriors = quotient( ones , ( ones + eta ) ) ;

// Activate the hidden units    
//          Octave code ---> hid_states = double( h_posteriors' > rand( num_cases , num_hid ) ) ;
     for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < hid_states.rows () ; ii_nest_loop++ )
         {
         for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < hid_states.cols () ; jj_nest_loop++ )
             {
      rand_uniform_value = mtrand1.randDblExc() ; // a real number in the range 0 to 1, excluding both 0 and 1
      hid_states( ii_nest_loop , jj_nest_loop ) = h_posteriors( jj_nest_loop , ii_nest_loop ) > rand_uniform_value ? 1.0 : 0.0 ;
             }
         } // end of hid_states loop

// Calculate positive gradients ( note w.r.t. neg energy )
// Octave code ---> w_grad = hid_states' * ( data( : , : , 1 ) ./ gsd ) ;
//                  bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
//                  bj_grad = sum( hid_states , 1 )' ;
            w_grad = hid_states.transpose () * data_0 ;

//          get repmat( bi , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
            for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_cases ; jj_nest_loop++ ) // loop over the columns
                {
                repmat_bi.insert( bi , 0 , jj_nest_loop ) ; 
                }  
            bi_grad = ( data_0_transpose - repmat_bi - bi_star ).sum ( 1 ) ;
            bj_grad = ( hid_states.sum ( 0 ) ).transpose () ;
    
// Octave code ---> for hh = 1 : nt      
//                  A_grad( : , : , hh ) = ( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
//                  B_grad( : , : , hh ) = hid_states' * data( : , : , hh + 1 ) ;      
//                  end
            for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
                {
                p( 2 ) = hh ;                                                                    // set Array  p to write to page hh of A_grad and B_grad NDArrays
                data_extract = data.page ( hh + 1 ) ;                                            // get the equivalent of data( : , : , hh + 1 )
                A_grad.insert( ( data_0_transpose - repmat_bi - bi_star ) * data_extract , p ) ;
                B_grad.insert( hid_states.transpose () * data_extract , p ) ;
                }
// the above code comes from http://stackoverflow.com/questions/29572075/how-do-you-create-an-arrayoctave-idx-type-in-an-oct-file
// e.g.
//
// Array p (dim_vector (3, 1));
// int n = 2;
// dim_vector dim(n, n, 3);
// NDArray a_matrix(dim);
// 
// for (octave_idx_type i = 0; i < n; i++)
//   for (octave_idx_type j = 0; j < n; j++)
//     a_matrix(i,j, 1) = (i + 1) * 10 + (j + 1);
// 
// std::cout << a_matrix;
// 
// Matrix b_matrix = Matrix (n, n);
// b_matrix(0, 0) = 1; 
// b_matrix(0, 1) = 2; 
// b_matrix(1, 0) = 3; 
// b_matrix(1, 1) = 4; 
// std::cout << b_matrix;
// 
// Array p (dim_vector (3, 1), 0);
// p(2) = 2;
// a_matrix.insert (b_matrix, p);
// 
// std::cout << a_matrix;

// %%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
 
// Activate the visible units
// Octave code ---> topdown = hid_states * w ;
   topdown = hid_states * w ;
    
// get repmat( bi' , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
   bi_transpose = bi.transpose () ;
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_cases ; ii_nest_loop++ ) // loop over the rows
       {
       repmat_bi_transpose.insert( bi_transpose , ii_nest_loop , 0 ) ; 
       }
       
       eta = topdown + repmat_bi_transpose + bi_star.transpose () ;

// Octave code ---> negdata = 1 ./ ( 1 + exp( -eta ) ) ; % logistic
// -exponate the eta matrix
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
       { 
        for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
            {
            eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
            }
       }   
  
// element division  A./B == quotient(A,B) 
   negdata = quotient( ones_2 , ( ones_2 + eta ) ) ;   
   
// Now conditional on negdata, calculate "posterior" probability for hiddens
// Octave code ---> eta = w * ( negdata ./ gsd )' + ...      % bottom-up connections
//                        repmat( bj , 1 , num_cases ) + ... % static biases on unit (no change)
//                        bj_star ;                          % dynamic biases (no change)

   negdata_transpose = negdata.transpose () ;             // to save repetition of transpose
   eta = w * negdata_transpose + repmat_bj + bj_star ;
   
// h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ;    % logistic
// -exponate the eta matrix
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
       { 
        for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
            {
            eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
            }
       }

// element division  A./B == quotient(A,B) 
   h_posteriors = quotient( ones , ( ones + eta ) ) ;

// Calculate negative gradients
// Octave code ---> neg_w_grad = h_posteriors * ( negdata ./ gsd ) ; % not using activations
   neg_w_grad = h_posteriors * negdata ; // not using activations
   
// Octave code ---> neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
   neg_bi_grad = ( negdata_transpose - repmat_bi - bi_star ).sum ( 1 ) ;
   
// Octave code ---> neg_bj_grad = sum( h_posteriors , 2 ) ;
   neg_bj_grad = h_posteriors.sum ( 1 ) ;
   
// Octave code ---> for hh = 1 : nt
//                      neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
//                      neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;        
//                  end
   
   for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
       {
       p( 2 ) = hh ;                                                                            // set Array  p to write to page hh of A_grad and B_grad NDArrays
       
       data_extract = data.page ( hh + 1 ) ;                                                    // get the equivalent of data( : , : , hh + 1 )
       neg_A_grad.insert( ( negdata_transpose - repmat_bi - bi_star ) * data_extract , p ) ;
       neg_B_grad.insert( h_posteriors * data_extract , p ) ;
   
       } // end of hh loop   

// %%%%%%%%% END NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

// Octave code ---> err = sum( sum( ( data( : , : , 1 ) - negdata ) .^2 ) ) ;
// Not used         errsum = err + errsum ;

// Octave code ---> if ( epoch > 5 ) % use momentum
//                  momentum = mom ;
//                  else % no momentum
//                  momentum = 0 ;
//                  end

   // momentum was initialised to 0.0, but on the 6th iteration of epoch, set momentum to 0.9
   if ( epoch == 5 ) // will only be true once, after which momentum will == 0.9
      {
        momentum_w.fill( 0.9 ) ;
        momentum_bi.fill( 0.9 ) ;
        momentum_bj.fill( 0.9 ) ;
        momentum_A.fill( 0.9 ) ;
        momentum_B.fill( 0.9 ) ;
      }
     
// %%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%        

// Octave code ---> w_update = momentum * w_update + epsilon_w * ( ( w_grad - neg_w_grad ) / num_cases - w_decay * w ) ;
   w_update = product( momentum_w , w_update ) + product( epsilon_w , quotient( ( w_grad - neg_w_grad ) , num_cases_matrices_w_and_B ) - product( w_decay , w ) ) ;
   
// Octave code ---> bi_update = momentum * bi_update + ( epsilon_bi / num_cases ) * ( bi_grad - neg_bi_grad ) ;
   bi_update = product( momentum_bi , bi_update ) + product( quotient( epsilon_bi , num_cases_matrix_bi ) , ( bi_grad - neg_bi_grad ) ) ;
   
// Octave code ---> bj_update = momentum * bj_update + ( epsilon_bj / num_cases ) * ( bj_grad - neg_bj_grad ) ;
   bj_update = product( momentum_bj , bj_update ) + product( quotient( epsilon_bj , num_cases_matrix_bj ) , ( bj_grad - neg_bj_grad ) ) ;
 
// The following two Octave code loops are combined into the single .oct loop that follows them
//   
// Octave code ---> for hh = 1 : nt
//                      A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;
//                      B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
//                  end
   
// Octave code ---> for hh = 1 : nt
//                      A( : , : , hh ) = A( : , : , hh ) + A_update( : , : , hh ) ;
//                      B( : , : , hh ) = B( : , : , hh ) + B_update( : , : , hh ) ;
//                  end   

   for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
       {
       p( 2 ) = hh ;
   
       A_update_extract = A_update.page ( hh ) ;
       A_grad_extract = A_grad.page ( hh ) ;
       neg_A_grad_extract = neg_A_grad.page ( hh ) ;
       A_extract = A.page ( hh ) ;
       A_update.insert( product( momentum_A , A_update_extract ) + product( epsilon_A , ( quotient( ( A_grad_extract - neg_A_grad_extract ) , num_cases_matrix_A ) - product( w_decay_A , A_extract ) ) ) , p ) ;
       A_update_extract = A_update.page ( hh ) ;
       A.insert( A_extract + A_update_extract , p ) ;
   
       B_update_extract = B_update.page ( hh ) ;
       B_grad_extract = B_grad.page ( hh ) ;
       neg_B_grad_extract = neg_B_grad.page ( hh ) ;
       B_extract = B.page ( hh ) ;
       B_update.insert( product( momentum_B , B_update_extract ) + product( epsilon_B , ( quotient( ( B_grad_extract - neg_B_grad_extract ) , num_cases_matrices_w_and_B ) - product( w_decay_B , B_extract ) ) ) , p ) ;
       B_update_extract = B_update.page ( hh ) ;
       B.insert( B_extract + B_update_extract , p ) ;
       
       } // end of hh loop
  
// Octave code ---> w = w + w_update ;
//                  bi = bi + bi_update ;
//                  bj = bj + bj_update ;

   w += w_update ;
   bi += bi_update ;
   bj += bj_update ;
       
// %%%%%%%%%%%%%%%% END OF UPDATES  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            } // end of batch loop

    } // end of main epoch loop

// return the values w , bj , bi , A , B
retval_list(4) = B ;
retval_list(3) = A ;
retval_list(2) = bi ;    
retval_list(1) = bj ;
retval_list(0) = w ;

return retval_list ;
  
} // end of function

Thursday, 16 April 2015

Optimised CRBM Code for Gaussian Units

Over the last few weeks I have been working on optimising the conditional restricted boltzmann machine code, with a view to speeding it up via a C++ .oct file, and in the code box below is this .oct code for the gaussian_crbm.m code in my previous post. This gaussian_crbm.m function, plus the binary_crbm.m one, are the speed bottlenecks whilst training the crbm. In the code below I have made some adjustments for code simplification purposes, the most important of which are:

the minibatch index is a matrix rather than a cell type index, with the individual batch indexes in the columns, which means that the batch sizes are all equal in length.
as a result all data in cell arrays in the .m function are in NDArrays in the .oct function

there is no input for the variable "gsd" because I have assumed a value of 1 applies. This means the input data must be normalised prior to the function call.

DEFUN_DLD ( cc_gaussian_crbm_mersenne , args , nargout ,
"-*- texinfo -*-\n\
@deftypefn {Function File} {} cc_gaussian_crbm_mersenne (@var{ batchdata , minibatch , nt , num_epochs , num_hid })\n\n\
This function trains a real valued, gaussian crbm where the gsd is assumed to be 1, so the batchdata input must be z-score normalised.\n\
The value nt is the order of the model, i.e. how many previous values should be included, num_epochs is the number of training epochs,\n\
and num_hid is the number of nodes in the hidden layer.\n\
@end deftypefn" )

{
octave_value_list retval_list ;
int nargin = args.length () ;

// check the input arguments
if ( nargin != 5 )
   {
   error ( "Input error: there are insufficient input arguments. Type help for more details." ) ;
   return retval_list ;
   }

if ( !args(0).is_real_matrix () ) // check batchdata
   {
   error ( "Input error: the 1st argument, batchdata, is not a matrix type. Type help for more details." ) ;
   return retval_list ;
   }

if ( !args(1).is_real_matrix () ) // check minibatch
   {
   error ( "Input error: the 2nd argument, minibatch, is not a matrix type. Type help for more details." ) ;
   return retval_list ;
   }   
   
if ( args(2).length () != 1 ) // check nt
   {
   error ( "Input error: nt should be an interger value for the 'order' of the model. Type help for more details." ) ;
   return retval_list ;
   }
   
if ( args(3).length () != 1 ) // num_epochs
   {
   error ( "Input error: num_epochs should be an integer value for the number of training epochs. Type help for more details." ) ;
   return retval_list ;
   }
   
if ( args(4).length () != 1 ) // check num_hid
   {
   error ( "Input error: num_hid should be an integer for the number of nodes in hidden layer. Type help for more details." ) ;
   return retval_list ;
   }
     
if ( error_state )
   {
   error ( "Input error: type help for details." ) ;
   return retval_list ;
   }
// end of input checking
  
// inputs
Matrix batchdata = args(0).matrix_value () ;
Matrix minibatch = args(1).matrix_value () ;
int nt = args(2).int_value () ; // the "order" of the model
int num_epochs = args(3).int_value () ;
int num_hid = args(4).int_value () ;

// variables
// batchdata is a big matrix of all the data and we index it with "minibatch", a matrix of mini-batch indices in the columns                       
int num_cases = minibatch.rows () ;                                                     // Octave code ---> num_cases = length( minibatch{ batch } ) ;
int num_dims = batchdata.cols () ;                                                      // visible dimension
Matrix bi_star ( num_dims , num_cases ) ; bi_star.fill( 0.0 ) ;                         // Octave code ---> bi_star = zeros( num_dims , num_cases ) ; 
Matrix bj_star ( num_hid , num_cases ) ; bj_star.fill( 0.0 ) ;                          // Octave code ---> bj_star = zeros( num_hid , num_cases ) ; 
Matrix repmat_bj ( num_hid , num_cases ) ; repmat_bj.fill( 0.0 ) ;                      // for Octave code ---> repmat( bj , 1 , num_cases )
Matrix repmat_bi ( num_dims , num_cases ) ; repmat_bi.fill( 0.0 ) ;                     // for Octave code ---> repmat( bi , 1 , num_cases )
Matrix eta ( num_hid , num_cases ) ; eta.fill( 0.0 ) ;
Matrix h_posteriors ( num_hid , num_cases ) ; h_posteriors.fill( 0.0 ) ;                // for the logistic function
Matrix ones ( num_hid , num_cases ) ; ones.fill( 1.0 ) ;                                // for the logistic function
Matrix hid_states ( num_cases , num_hid ) ; hid_states.fill( 0.0 ) ;                    // for hid_states = double( h_posteriors' > rand( num_cases , num_hid ) ) ;
Matrix w_grad ( num_hid , num_dims ) ; w_grad.fill( 0.0 ) ;                             // for w_grad = hid_states' * ( data( : , : , 1 ) ./ gsd ) ;
Matrix bi_grad ( num_dims , 1 ) ; bi_grad.fill( 0.0 ) ;                                 // for bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
Matrix bj_grad ( num_hid , 1 ) ; bj_grad.fill( 0.0 ) ;                                  // for bj_grad = sum( hid_states , 1 )' ;
Matrix topdown ( num_cases , num_dims ) ; topdown.fill( 0.0 ) ;                         // for topdown = gsd .* ( hid_states * w ) ;
Matrix negdata ( num_cases , num_dims ) ; negdata.fill( 0.0 ) ;
Matrix negdata_transpose ( num_dims , num_cases ) ; negdata_transpose.fill( 0.0 ) ;
Matrix bi_transpose ( 1 , num_dims ) ; bi_transpose.fill( 0.0 ) ;
Matrix repmat_bi_transpose ( num_cases , num_dims ) ; repmat_bi_transpose.fill( 0.0 ) ; 
Matrix neg_w_grad ( num_hid , num_dims ) ; neg_w_grad.fill( 0.0 ) ;
Matrix neg_bi_grad ( num_dims , 1 ) ; neg_bi_grad.fill( 0.0 ) ;                         // for neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
Matrix neg_bj_grad ( num_hid , 1 ) ; neg_bj_grad.fill( 0.0 ) ;                          // for neg_bj_grad = sum( h_posteriors , 2 ) ;
  
// Setting learning rates and create some utility matrices
Matrix epsilon_w ( num_hid , num_dims ) ; epsilon_w.fill( 0.001 ) ;   // undirected
Matrix epsilon_bi ( num_dims , 1 ) ; epsilon_bi.fill( 0.001 ) ;       // visibles
Matrix epsilon_bj ( num_hid , 1 ) ; epsilon_bj.fill( 0.001 ) ;        // hidden units
Matrix epsilon_A ( num_dims , num_dims ) ; epsilon_A.fill( 0.001 ) ;  // autoregressive
Matrix epsilon_B ( num_hid , num_dims ) ; epsilon_B.fill( 0.001 ) ;   // prev visibles to hidden
Matrix w_decay ( num_hid , num_dims ) ; w_decay.fill( 0.0002 ) ;      // currently we use the same weight decay for w, A, B
Matrix w_decay_A ( num_dims , num_dims ) ; w_decay_A.fill( 0.0002 ) ; // currently we use the same weight decay for w, A, B
Matrix w_decay_B ( num_hid , num_dims ) ; w_decay_B.fill( 0.0002 ) ;  // currently we use the same weight decay for w, A, B

Matrix momentum_w ( num_hid , num_dims ) ; momentum_w.fill( 0.0 ) ;   // momentum used only after 5 epochs of training, when it will be set to 0.9
Matrix num_cases_matrices_w_and_B ( num_hid , num_dims ) ; num_cases_matrices_w_and_B.fill( num_cases ) ;

Matrix momentum_bi ( num_dims , 1 ) ; momentum_bi.fill( 0.0 ) ;
Matrix num_cases_matrix_bi ( num_dims , 1 ) ; num_cases_matrix_bi.fill( num_cases ) ;

Matrix momentum_bj ( num_hid , 1 ) ; momentum_bj.fill( 0.0 ) ;
Matrix num_cases_matrix_bj ( num_hid , 1 ) ; num_cases_matrix_bj.fill( num_cases ) ;

Matrix momentum_A ( num_dims , num_dims ) ; momentum_A.fill( 0.0 ) ;
Matrix num_cases_matrix_A ( num_dims , num_dims ) ; num_cases_matrix_A.fill( num_cases ) ;

Matrix momentum_B ( num_hid , num_dims ) ; momentum_B.fill( 0.0 ) ;

// initialization of output matrices
Matrix w ( num_hid , num_dims ) ; w.fill( 0.0 ) ; // Octave code ---> w = 0.01 * randn( num_hid , num_dims ) ;
Matrix bi ( num_dims , 1 ) ; bi.fill( 0.0 ) ;     // Octave code ---> bi = 0.01 * randn( num_dims , 1 ) ;
Matrix bj( num_hid , 1 ) ; bj.fill( 0.0 ) ;       // Octave code ---> bj = -1 + 0.01 * randn( num_hid , 1 ) ; // set to favour units being "off"

// The autoregressive weights; A( : , : , j ) is the weight from t-j to the visible
NDArray A ( dim_vector( num_dims , num_dims , nt ) ) ; A.fill( 0.0 ) ; // Octave code ---> A = 0.01 * randn( num_dims ,num_dims , nt ) ;

// The weights from previous time-steps to the hiddens; B( : , : , j ) is the weight from t-j to the hidden layer
NDArray B ( dim_vector( num_hid , num_dims , nt ) ) ; B.fill( 0.0 ) ;  // Octave code ---> B = 0.01 * randn( num_hid , num_dims , nt ) ;

// Declare MersenneTwister random values
MTRand mtrand1 ;
double rand_norm_value ;
double rand_uniform_value ;

// nested loops to fill w, bi, bj, A and B with initial random values
for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_hid ; ii_nest_loop++ )
    { 
    rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
    bj ( ii_nest_loop , 0 ) = -1.0 + rand_norm_value ; // set to favour units being "off"

     for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims ; jj_nest_loop++ )
         {
         rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
         w ( ii_nest_loop , jj_nest_loop ) = rand_norm_value ;
         }
    }
    
for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_dims ; ii_nest_loop++ )
    {
    rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
    bi ( ii_nest_loop , 0 ) = rand_norm_value ;
    }

for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
    {
      for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_dims ; ii_nest_loop++ )
          {
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
               {
               rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
               A ( ii_nest_loop , jj_nest_loop , hh ) = rand_norm_value ;
               }  // end of jj_nest_loop loop      
          }       // end of ii_nest_loop loop
    }             // end of hh loop
    
for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
    {
      for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_hid ; ii_nest_loop++ )
          {
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
               {
               rand_norm_value = mtrand1.randNorm( 0.0 , 0.01 ) ; // mean of zero and std of 0.01
               B ( ii_nest_loop , jj_nest_loop , hh ) = rand_norm_value ;
               }  // end of jj_nest_loop loop      
          }       // end of ii_nest_loop loop
    }             // end of hh loop        
    
// keep previous updates around for momentum
Matrix w_update ( num_hid , num_dims ) ; w_update.fill( 0.0 ) ; // Octave code ---> w_update = zeros( size( w ) ) ;
Matrix bi_update ( num_dims , 1 ) ; bi_update.fill( 0.0 ) ;     // Octave code ---> bi_update = zeros( size( bi ) ) ;
Matrix bj_update ( num_hid , 1 ) ; bj_update.fill( 0.0 ) ;      // Octave code ---> bj_update = zeros( size( bj ) ) ;

NDArray A_update ( dim_vector( num_dims , num_dims , nt ) ) ; A_update.fill( 0.0 ) ; // Octave code ---> A_update = zeros( size( A ) ) ;
Matrix A_extract ( num_dims , num_dims ) ; A_extract.fill( 0.0 ) ;                   // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix A_update_extract ( num_dims , num_dims ) ; A_update_extract.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray B_update ( dim_vector( num_hid , num_dims , nt ) ) ; B_update.fill( 0.0 ) ;  // Octave code ---> B_update = zeros( size( B ) ) ;
Matrix B_extract ( num_hid , num_dims ) ; B_extract.fill( 0.0 ) ;                    // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix B_update_extract ( num_hid , num_dims ) ; B_update_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArray

// data is a nt+1 dimensional array with current and delayed data corresponding to mini-batches
// num_cases = minibatch( batch ).length () ;
num_cases = minibatch.rows () ;
NDArray data ( dim_vector( num_cases , num_dims , nt + 1 ) ) ; data.fill( 0.0 ) ; // Octave code ---> data = zeros( num_cases , num_dims , nt + 1 ) ;
Matrix data_extract ( num_cases , num_dims ) ; data_extract.fill( 0.0 ) ;         // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_transpose ( num_dims , num_cases ) ; data_transpose.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_0 ( num_cases , num_dims ) ; data_0.fill( 0.0 ) ;                     // matrix for intermediate calculations because cannot directly "slice" into a NDArray
Matrix data_0_transpose ( num_dims , num_cases ) ; data_0_transpose.fill( 0.0 ) ; // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray A_grad ( dim_vector( num_dims , num_dims , nt ) ) ; A_grad.fill( 0.0 ) ;  // for A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;                                    
Matrix A_grad_extract ( num_dims , num_dims ) ; A_grad_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray neg_A_grad ( dim_vector( num_dims , num_dims , nt ) ) ; neg_A_grad.fill( 0.0 ) ; // for neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
Matrix neg_A_grad_extract ( num_dims , num_dims ) ; neg_A_grad_extract.fill( 0.0 ) ;     // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray B_grad ( dim_vector( num_hid , num_dims , nt ) ) ; B_grad.fill( 0.0 ) ;          // for B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
Matrix B_grad_extract ( num_hid , num_dims ) ; B_grad_extract.fill( 0.0 ) ;              // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

NDArray neg_B_grad ( dim_vector( num_hid , num_dims , nt ) ) ; neg_B_grad.fill( 0.0 ) ;  // for neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;
Matrix neg_B_grad_extract ( num_hid , num_dims ) ; neg_B_grad_extract.fill( 0.0 ) ;      // matrix for intermediate calculations because cannot directly "slice" into a NDArrays

Array  p ( dim_vector ( nt , 1 ) , 0 ) ;                                // vector for writing to A_grad and B_grad

// end of initialization of matrices

// %%%%%%%%% THE MAIN CODE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

for ( octave_idx_type epoch ( 0 ) ; epoch < num_epochs ; epoch++ ) // main epoch loop
    { 
//     // errsum = 0 ; % keep a running total of the difference between data and recon
//       
 for ( octave_idx_type batch ( 0 ) ; batch < minibatch.cols () ; batch++ ) // Octave code ---> int num_batches = minibatch.length () ;
     {
// %%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%      

// // These next nested loops fill the data NDArray with the values from batchdata indexed
// // by the column values of minibatch. Octave code equivalent given below:-
// // Octave code ---> mb = minibatch{ batch } ; % caches the indices   
// // Octave code ---> data( : , : , 1 ) = batchdata( mb , : ) ;
// // Octave code ---> for hh = 1 : nt
// // Octave code ---> data( : , : , hh + 1 ) = batchdata( mb - hh , : ) ;
// // Octave code ---> end
     for ( octave_idx_type hh ( 0 ) ; hh < nt + 1 ; hh++ )
         {
    for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_cases ; ii_nest_loop++ )
        {
   for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_dims  ; jj_nest_loop++ )
       {
       data( ii_nest_loop , jj_nest_loop , hh ) = batchdata( minibatch( ii_nest_loop , batch ) - 1 - hh , jj_nest_loop ) ; // -1 for .oct zero based indexing vs Octave's 1 based
       } // end of jj_nest_loop loop      
        }       // end of ii_nest_loop loop
         }             // end of hh loop

// The above data filling loop could perhaps be implemented more quickly using the fortrans vec method as below
// http://stackoverflow.com.80bola.com/questions/28900153/create-a-ndarray-in-an-oct-file-from-a-double-pointer
// NDArray a (dim_vector(dim[0], dim[1], dim[2]));
// 
// Then loop over (i, j, k) indices to copy the cube to the octave array
// 
// double* a_vec = a.fortran_vec ();
// for (int i = 0; i < dim[0]; i++) {
//     for (int j = 0; j < dim[1]; j++) {
//         for (int k = 0; k < dim[2]; k++) {
//             *a_vec++ = armadillo_cube(i, j, k);
//         }
//     }
// }

// calculate contributions from directed autoregressive connections and contributions from directed visible-to-hidden connections
            bi_star.fill( 0.0 ) ; // Octave code ---> bi_star = zeros( num_dims , num_cases ) ; ( matrix declared earlier in code above )
            bj_star.fill( 0.0 ) ; // Octave code ---> bj_star = zeros( num_hid , num_cases ) ; ( matrix declared earlier in code above )
            
            // The code below calculates two separate Octave code loops in one nested C++ loop structure, namely
            // Octave code ---> for hh = 1 : nt       
            //                      bi_star = bi_star +  A( : , : , hh ) * data( : , : , hh + 1 )' ;
            //                  end 
            // and
            // Octave code ---> for hh = 1:nt
            //                      bj_star = bj_star + B( : , : , hh ) * data( : , : , hh + 1 )' ;
            //                  end 
            
            for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
                {
                // fill the intermediate calculation matrices 
                A_extract = A.page ( hh ) ;
                B_extract = B.page ( hh ) ;
                data_transpose = ( data.page ( hh + 1 ) ).transpose () ;

                // add up the hh different matrix multiplications
                bi_star += A_extract * data_transpose ; 
                bj_star += B_extract * data_transpose ;
     
         } // end of hh loop
         
                // extract and pre-calculate to save time in later computations
                data_0 = data.page ( 0 ) ;
                data_0_transpose = data_0.transpose () ;
                  
// Calculate "posterior" probability -- hidden state being on ( Note that it isn't a true posterior )   
//          Octave code ---> eta = w * ( data( : , : , 1 ) ./ gsd )' + ... % bottom-up connections
//                                 repmat( bj , 1 , num_cases ) + ...      % static biases on unit
//                                 bj_star ;                               % dynamic biases

//          get repmat( bj , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
            for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_cases ; jj_nest_loop++ ) // loop over the columns
         {
         repmat_bj.insert( bj , 0 , jj_nest_loop ) ; 
         }

            eta = w * data_0_transpose + repmat_bj + bj_star ;       
 
            // h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ;    % logistic
            // -exponate the eta matrix
            for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
                { 
           for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
               {
               eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
               }
         }

            // element division  A./B == quotient(A,B) 
            h_posteriors = quotient( ones , ( ones + eta ) ) ;

// Activate the hidden units    
//          Octave code ---> hid_states = double( h_posteriors' > rand( num_cases , num_hid ) ) ;
     for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < hid_states.rows () ; ii_nest_loop++ )
         {
         for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < hid_states.cols () ; jj_nest_loop++ )
             {
      rand_uniform_value = mtrand1.randDblExc() ; // a real number in the range 0 to 1, excluding both 0 and 1
      hid_states( ii_nest_loop , jj_nest_loop ) = h_posteriors( jj_nest_loop , ii_nest_loop ) > rand_uniform_value ? 1.0 : 0.0 ;
             }
         } // end of hid_states loop

// Calculate positive gradients ( note w.r.t. neg energy )
// Octave code ---> w_grad = hid_states' * ( data( : , : , 1 ) ./ gsd ) ;
//                  bi_grad = sum( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
//                  bj_grad = sum( hid_states , 1 )' ;
            w_grad = hid_states.transpose () * data_0 ;

//          get repmat( bi , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
            for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < num_cases ; jj_nest_loop++ ) // loop over the columns
                {
                repmat_bi.insert( bi , 0 , jj_nest_loop ) ; 
                }  
            bi_grad = ( data_0_transpose - repmat_bi - bi_star ).sum ( 1 ) ;
            bj_grad = ( hid_states.sum ( 0 ) ).transpose () ;
    
// Octave code ---> for hh = 1 : nt      
//                  A_grad( : , : , hh ) = ( data( : , : , 1 )' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
//                  B_grad( : , : , hh ) = hid_states' * data( : , : , hh + 1 ) ;      
//                  end
            for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
                {
                p( 2 ) = hh ;                                                                    // set Array  p to write to page hh of A_grad and B_grad NDArrays
                data_extract = data.page ( hh + 1 ) ;                                            // get the equivalent of data( : , : , hh + 1 )
                A_grad.insert( ( data_0_transpose - repmat_bi - bi_star ) * data_extract , p ) ;
                B_grad.insert( hid_states.transpose () * data_extract , p ) ;
                }
// the above code comes from http://stackoverflow.com/questions/29572075/how-do-you-create-an-arrayoctave-idx-type-in-an-oct-file
// e.g.
//
// Array p (dim_vector (3, 1));
// int n = 2;
// dim_vector dim(n, n, 3);
// NDArray a_matrix(dim);
// 
// for (octave_idx_type i = 0; i < n; i++)
//   for (octave_idx_type j = 0; j < n; j++)
//     a_matrix(i,j, 1) = (i + 1) * 10 + (j + 1);
// 
// std::cout << a_matrix;
// 
// Matrix b_matrix = Matrix (n, n);
// b_matrix(0, 0) = 1; 
// b_matrix(0, 1) = 2; 
// b_matrix(1, 0) = 3; 
// b_matrix(1, 1) = 4; 
// std::cout << b_matrix;
// 
// Array p (dim_vector (3, 1), 0);
// p(2) = 2;
// a_matrix.insert (b_matrix, p);
// 
// std::cout << a_matrix;

// %%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
 
// Activate the visible units
// Find the mean of the Gaussian
// Octave code ---> topdown = gsd .* ( hid_states * w ) ;
   topdown = hid_states * w ;
   
// This is the mean of the Gaussian. Instead of properly sampling, negdata is just the mean
// If we want to sample from the Gaussian, we would add in gsd .* randn( num_cases , num_dims ) ;
// Octave code ---> negdata = topdown + ...                       % top down connections
//                            repmat( bi' , num_cases , 1 ) + ... % static biases
//                            bi_star' ;                          % dynamic biases
   
// get repmat( bi' , 1 , num_cases ) ( http://stackoverflow.com/questions/19273053/write-to-a-matrix-in-oct-file-without-looping?rq=1 )
   bi_transpose = bi.transpose () ;
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < num_cases ; ii_nest_loop++ ) // loop over the rows
       {
       repmat_bi_transpose.insert( bi_transpose , ii_nest_loop , 0 ) ; 
       } 
       negdata = topdown + repmat_bi_transpose + bi_star.transpose () ;

// Now conditional on negdata, calculate "posterior" probability for hiddens
// Octave code ---> eta = w * ( negdata ./ gsd )' + ...      % bottom-up connections
//                        repmat( bj , 1 , num_cases ) + ... % static biases on unit (no change)
//                        bj_star ;                          % dynamic biases (no change)

   negdata_transpose = negdata.transpose () ;             // to save repetition of transpose
   eta = w * negdata_transpose + repmat_bj + bj_star ;
   
// h_posteriors = 1 ./ ( 1 + exp( -eta ) ) ;    % logistic
// -exponate the eta matrix
   for ( octave_idx_type ii_nest_loop ( 0 ) ; ii_nest_loop < eta.rows () ; ii_nest_loop++ )
       { 
        for ( octave_idx_type jj_nest_loop ( 0 ) ; jj_nest_loop < eta.cols () ; jj_nest_loop++ )
            {
            eta ( ii_nest_loop , jj_nest_loop ) = exp( - eta ( ii_nest_loop , jj_nest_loop ) ) ;
            }
       }

// element division  A./B == quotient(A,B) 
   h_posteriors = quotient( ones , ( ones + eta ) ) ;

// Calculate negative gradients
// Octave code ---> neg_w_grad = h_posteriors * ( negdata ./ gsd ) ; % not using activations
   neg_w_grad = h_posteriors * negdata ; // not using activations
   
// Octave code ---> neg_bi_grad = sum( negdata' - repmat( bi , 1 , num_cases ) - bi_star , 2 ) ./ gsd^2 ;
   neg_bi_grad = ( negdata_transpose - repmat_bi - bi_star ).sum ( 1 ) ;
   
// Octave code ---> neg_bj_grad = sum( h_posteriors , 2 ) ;
   neg_bj_grad = h_posteriors.sum ( 1 ) ;
   
// Octave code ---> for hh = 1 : nt
//                      neg_A_grad( : , : , hh ) = ( negdata' - repmat( bi , 1 , num_cases ) - bi_star ) ./ gsd^2 * data( : , : , hh + 1 ) ;
//                      neg_B_grad( : , : , hh ) = h_posteriors * data( : , : , hh + 1 ) ;        
//                  end
   
   for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
       {
       p( 2 ) = hh ;                                                                            // set Array  p to write to page hh of A_grad and B_grad NDArrays
       
       data_extract = data.page ( hh + 1 ) ;                                                    // get the equivalent of data( : , : , hh + 1 )
       neg_A_grad.insert( ( negdata_transpose - repmat_bi - bi_star ) * data_extract , p ) ;
       neg_B_grad.insert( h_posteriors * data_extract , p ) ;
   
       } // end of hh loop   

// %%%%%%%%% END NEGATIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

// Octave code ---> err = sum( sum( ( data( : , : , 1 ) - negdata ) .^2 ) ) ;
// Not used         errsum = err + errsum ;

// Octave code ---> if ( epoch > 5 ) % use momentum
//                  momentum = mom ;
//                  else % no momentum
//                  momentum = 0 ;
//                  end

   // momentum was initialised to 0.0, but on the 6th iteration of epoch, set momentum to 0.9
   if ( epoch == 5 ) // will only be true once, after which momentum will == 0.9
      {
        momentum_w.fill( 0.9 ) ;
        momentum_bi.fill( 0.9 ) ;
        momentum_bj.fill( 0.9 ) ;
        momentum_A.fill( 0.9 ) ;
        momentum_B.fill( 0.9 ) ;
      }
     
// %%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%        

// Octave code ---> w_update = momentum * w_update + epsilon_w * ( ( w_grad - neg_w_grad ) / num_cases - w_decay * w ) ;
   w_update = product( momentum_w , w_update ) + product( epsilon_w , quotient( ( w_grad - neg_w_grad ) , num_cases_matrices_w_and_B ) - product( w_decay , w ) ) ;
   
// Octave code ---> bi_update = momentum * bi_update + ( epsilon_bi / num_cases ) * ( bi_grad - neg_bi_grad ) ;
   bi_update = product( momentum_bi , bi_update ) + product( quotient( epsilon_bi , num_cases_matrix_bi ) , ( bi_grad - neg_bi_grad ) ) ;
   
// Octave code ---> bj_update = momentum * bj_update + ( epsilon_bj / num_cases ) * ( bj_grad - neg_bj_grad ) ;
   bj_update = product( momentum_bj , bj_update ) + product( quotient( epsilon_bj , num_cases_matrix_bj ) , ( bj_grad - neg_bj_grad ) ) ;
 
// The following two Octave code loops are combined into the single .oct loop that follows them
//   
// Octave code ---> for hh = 1 : nt
//                      A_update( : , : , hh ) = momentum * A_update( : , : , hh ) + epsilon_A * ( ( A_grad( : , : , hh ) - neg_A_grad( : , : , hh ) ) / num_cases - w_decay * A( : , : , hh ) ) ;
//                      B_update( : , : , hh ) = momentum * B_update( : , : , hh ) + epsilon_B * ( ( B_grad( : , : , hh ) - neg_B_grad( : , : , hh ) ) / num_cases - w_decay * B( : , : , hh ) ) ;
//                  end
   
// Octave code ---> for hh = 1 : nt
//                      A( : , : , hh ) = A( : , : , hh ) + A_update( : , : , hh ) ;
//                      B( : , : , hh ) = B( : , : , hh ) + B_update( : , : , hh ) ;
//                  end   

   for ( octave_idx_type hh ( 0 ) ; hh < nt ; hh++ )
       {
       p( 2 ) = hh ;
   
       A_update_extract = A_update.page ( hh ) ;
       A_grad_extract = A_grad.page ( hh ) ;
       neg_A_grad_extract = neg_A_grad.page ( hh ) ;
       A_extract = A.page ( hh ) ;
       A_update.insert( product( momentum_A , A_update_extract ) + product( epsilon_A , ( quotient( ( A_grad_extract - neg_A_grad_extract ) , num_cases_matrix_A ) - product( w_decay_A , A_extract ) ) ) , p ) ;
       A_update_extract = A_update.page ( hh ) ;
       A.insert( A_extract + A_update_extract , p ) ;
   
       B_update_extract = B_update.page ( hh ) ;
       B_grad_extract = B_grad.page ( hh ) ;
       neg_B_grad_extract = neg_B_grad.page ( hh ) ;
       B_extract = B.page ( hh ) ;
       B_update.insert( product( momentum_B , B_update_extract ) + product( epsilon_B , ( quotient( ( B_grad_extract - neg_B_grad_extract ) , num_cases_matrices_w_and_B ) - product( w_decay_B , B_extract ) ) ) , p ) ;
       B_update_extract = B_update.page ( hh ) ;
       B.insert( B_extract + B_update_extract , p ) ;
       
       } // end of hh loop
  
// Octave code ---> w = w + w_update ;
//                  bi = bi + bi_update ;
//                  bj = bj + bj_update ;

   w += w_update ;
   bi += bi_update ;
   bj += bj_update ;
       
// %%%%%%%%%%%%%%%% END OF UPDATES  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            } // end of batch loop

    } // end of main epoch loop

// return the values w , bj , bi , A , B
retval_list(4) = B ;
retval_list(3) = A ;
retval_list(2) = bi ;    
retval_list(1) = bj ;
retval_list(0) = w ;

return retval_list ;
  
} // end of function

The code is heavilly commented throughout.The .oct code for the binary_crbm.m function will follow in due course.

Sunday, 25 May 2014

Updated Cauchy-Schwarz Matching Algorithm

Following on from my previous post, below is a code box showing a slightly improved Cauchy-Schwarz matching algorithm, improved in the sense that this implementation has a slightly better effect size over random when the test runs of the previous post's version are compared with this version.

function [ top_matches ] = rolling_cauchy_schwarz_matching_algo_2( open_ch, high_ch, low_ch, close_ch, period )

% pre-allocate vectors in memory
cauchy_schwarz_values = zeros( size(close_ch,1) , 1 ) ;
top_matches = zeros( size(close_ch,1), 100 ) ;

% select price bar to train nn on
for jj = size(close_ch,1)-250 : size(close_ch,1)

lookback = period( jj ) ;
sample_to_match = [ close_ch( jj-lookback : jj )' high_ch( jj-9 : jj )' low_ch( jj-9 : jj )' ( close_ch( jj-4 : jj ).-open_ch( jj-4 : jj ) )' ] ;
norm_sample_to_match = norm( sample_to_match ) ;

% for this jj train_bar, calculate cauchy_schwarz matching values in the historical record up to index jj-2
for ii = 50 : jj - 2
cauchy_schwarz_values(ii) = abs( sample_to_match * [ close_ch( ii-lookback : ii ) ; high_ch( ii-9 : ii ) ; low_ch( ii-9 : ii ) ; ( close_ch( ii-4 : ii ).-open_ch( ii-4 : ii ) ) ] ) / ( norm_sample_to_match * norm( [ close_ch( ii-lookback : ii , 1 ) ; high_ch( ii-9 : ii ) ; low_ch( ii-9 : ii ) ; ( close_ch( ii-4 : ii ).-open_ch( ii-4 : ii ) ) ] ) ) ;
end % end of ii loop

% get the top 100 matches for this price bar
[ s, sort_index ] = sort( cauchy_schwarz_values ) ;
top_matches( jj, : ) = sort_index( end-99 : end )' ;

end % end of jj loop

end % end of function

The inputs are channel normalised prices, with the length of the channel being adaptive to the dominant cycle period. This function is called as part of a rolling neural net training regime to select the top n (n = 100 in this case) matches in the historical record as training data. The actual NN training code is a close adaptation of the code in my neural net walkforward training post, but with a couple of important caveats which are discussed below.

Firstly, when training a feedforward neural network it is normal that a certain number of samples are held out of the training set for use as a cross validation set. The point of this is to ensure that the trained NN will generalise well to as yet unseen data. In the case of my rolling training regime this does not apply. The NN that is being trained for the "current bar" will be used once to classify the "current bar" and then thrown away. The "next bar" will have a completely new NN trained specifically for it, which in its turn will be discarded, and so on and so on along the whole price history. There is no need to ensure generalisation of any specifically trained NN. This being the case, all the training set examples are used in the training and early stopping is implemented by a crude heuristic of classification accuracy on the training set: training stops when the classification error rate on the whole training set is <= 5%. Further experience with this in the future may lead me to make some adjustments, but for now this is what I am going with.

A second reason for adopting this approach stems from my reading of this book wherein it is stated that on financial time series the "traditional" machine learning error metrics can be misleading. It cites a (theoretical?) example of a profitable trading system that has been trained/optimised for maximum profit but has a counter-intuitive, negative R-squared. The explanation for this lies in the heavy tails of price distribution(s). It is in these tails that the extreme returns reside and where the big profits/losses are to be made. However, by using a more traditional error metric such as least squares a ML algorithm might concentrate on the central area of a price distribution in order to reduce the error metric on the majority of price instances and thereby ignore the tails, producing a nice, low error but a useless system. The converse can be true for a good system, in that the ML least squares metric can be rubbish but the relevant performance metric (max profit, min draw down, risk adjusted return etc.) of the system great.

It is for these reasons that I have adopted my current approach.

Pages