Prior to this code being run there is a bit of data pre-processing to create data batches and their indexes, and of course afterwards the generated weights will be used as initial starting weights for the CRBM training.
% PRE-TRAINING FOR THE VISABLE TO HIDDEN AND THE VISIBLE TO VISIBLE WEIGHTS %%%% % First create a training set and target set for the pre-training dAuto_Encode_targets = batchdata ; dAuto_Encode_training_data =  ; % loop to create the dAuto_Encode_training_data ( n1 == "order" of the gaussian layer of crbm ) for ii = 1 : n1 dAuto_Encode_training_data = [ dAuto_Encode_training_data shift( batchdata , ii ) ] ; end % now delete the first n1 rows due to circular shift induced mismatch of data and targets dAuto_Encode_targets( 1 : n1 , : ) =  ; dAuto_Encode_training_data( 1 : n1 , : ) =  ; % DO RBM PRE-TRAINING FOR THE BOTTOM UP DIRECTED WEIGHTS W %%%%%%%%%%%%%%%%%%%%% % use rbm trained initial weights instead of using random initialisation for weights % Doing this because we are not using regularisation in the autoencoder pre-training epochs = 5000 ; hidden_layer_size = 2 * size( dAuto_Encode_targets , 2 ) ; w_weights = gaussian_rbm( dAuto_Encode_targets , minibatch , epochs , hidden_layer_size ) ; A_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , size( dAuto_Encode_targets , 2 ) ) ; B_weights = gaussian_rbm( dAuto_Encode_training_data , minibatch , epochs , hidden_layer_size ) ; % create weight update matrices A_weights_update = zeros( size( A_weights ) ) ; B_weights_update = zeros( size( B_weights ) ) ; w_weights_update = zeros( size( w_weights ) ) ; % for adagrad historical_A = zeros( size( A_weights ) ) ; historical_B = zeros( size( B_weights ) ) ; historical_w = zeros( size( w_weights ) ) ; % set some training parameters n = size( dAuto_Encode_training_data , 1 ) ; % number of training examples in dAuto_Encode_training_data input_layer_size = size( dAuto_Encode_training_data , 2 ) ; fudge_factor = 1e-6 ; % for numerical stability for adagrad learning_rate = 0.1 ; % will be changed to 0.01 after 50 iters through epoch loop mom = 0 ; % will be changed to 0.9 after 50 iters through epoch loop noise = 0.5 ; epochs = 1000 ; cost = zeros( epochs , 1 ) ; lowest_cost = inf ; % Stochastic Gradient Descent training over dAuto_Encode_training_data for iter = 1 : epochs % change momentum and learning_rate after 50 iters if iter == 50 mom = 0.9 ; learning_rate = 0.01 ; end index = randperm( n ) ; % randomise the order of training examples for training_example = 1 : n % Select data for this training batch tmp_X = dAuto_Encode_training_data( index( training_example ) , : ) ; tmp_T = dAuto_Encode_targets( index( training_example ) , : ) ; % Randomly black out some of the input training data tmp_X( rand( size( tmp_X ) ) < noise ) = 0 ; % feedforward tmp_X through B_weights and get sigmoid e.g ret = 1.0 ./ ( 1.0 + exp(-input) ) tmp_X_through_sigmoid = 1.0 ./ ( 1.0 .+ exp( - B_weights * tmp_X' ) ) ; % Randomly black out some of tmp_X_through_sigmoid for dropout training tmp_X_through_sigmoid( rand( size( tmp_X_through_sigmoid ) ) < noise ) = 0 ; % feedforward tmp_X through A_weights and add to tmp_X_through_sigmoid * w_weights for linear output layer final_output_layer = ( tmp_X * A_weights' ) .+ ( tmp_X_through_sigmoid' * w_weights ) ; % now do backpropagation % this is the derivative of weights for the linear final_output_layer delta_out = ( tmp_T - final_output_layer ) ; % NOTE! gradient of sigmoid function g = sigmoid(z) .* ( 1.0 .- sigmoid(z) ) sig_grad = tmp_X_through_sigmoid .* ( 1 .- tmp_X_through_sigmoid ) ; % backpropagation only through the w_weights that are connected to tmp_X_through_sigmoid delta_hidden = ( w_weights * delta_out' ) .* sig_grad ; % apply deltas from backpropagation with adagrad for the weight updates historical_A = historical_A .+ ( delta_out' * tmp_X ).^2 ; A_weights_update = mom .* A_weights_update .+ ( learning_rate .* ( delta_out' * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_A ) ) ; historical_w = historical_w .+ ( tmp_X_through_sigmoid * delta_out ).^2 ; w_weights_update = mom .* w_weights_update .+ ( learning_rate .* ( tmp_X_through_sigmoid * delta_out ) ) ./ ( fudge_factor .+ sqrt( historical_w ) ) ; historical_B = historical_B .+ ( delta_hidden * tmp_X ).^2 ; B_weights_update = mom .* B_weights_update .+ ( learning_rate .* ( delta_hidden * tmp_X ) ) ./ ( fudge_factor .+ sqrt( historical_B ) ) ; % update the weight matrices with weight_updates A_weights = A_weights + A_weights_update ; B_weights = B_weights + B_weights_update ; w_weights = w_weights + w_weights_update ; end % end of training_example loop % feedforward with this epoch's updated weights epoch_trained_tmp_X_through_sigmoid = ( 1.0 ./ ( 1.0 .+ exp( -( ( B_weights./2 ) * dAuto_Encode_training_data' ) ) ) ) ; epoch_trained_output = ( dAuto_Encode_training_data * ( A_weights./2 )' ) .+ ( epoch_trained_tmp_X_through_sigmoid' * ( w_weights./2 ) ) ; % get sum squared error cost cost( iter , 1 ) = sum( sum( ( dAuto_Encode_targets .- epoch_trained_output ) .^ 2 ) ) ; % record best so far if cost( iter , 1 ) <= lowest_cost lowest_cost = cost( iter , 1 ) ; best_A = A_weights ./ 2 ; best_B = B_weights ./ 2 ; best_w = w_weights ./ 2; end end % end of backpropagation loop lowest_cost % print final cost to terminal graphics_toolkit( "qt" ) ; figure(1) ; plot( cost , 'r' , 'linewidth' , 3 ) ; % and plot the cost curve % plot weights graphics_toolkit( "gnuplot" ) ; figure(2) ; surf( best_A ) ; title( 'Best A Weights' ) ; figure(3) ; surf( best_B ) ; title( 'Best B Weights' ) ; figure(4) ; surf( best_w ) ; title( 'Best w Weights' ) ; % END OF CRBM WEIGHT PRE-TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
I will not discuss the code very much in this post as it will probably be heavily revised in the near future. However, I will point out a few things about it, namely
- I have used Restricted Boltzmann machine pre-training as initial weights for the autoencoder training
- I have applied dropout to the single hidden layer in the autoencoder
- early stopping, Momentum and AdaGrad are employed
- bias units are not included
- the code plots some charts for visual inspection (see below), which probably will not be present in the code's final version
The lower pane shows the evolution of cost on the Y axis vs. epoch on the X axis. The upper three panes show the weights on what are called the A, B and w weights respectively. A are the direct auto-regressive weights from training data to target data without any intervening hidden layers, B are the weights to the hidden sigmoid layer and w are the weights from the hidden layer to the targets. It is interesting to see how the A weights remain relatively consistent over the different sessions.