Friday 29 November 2019

RVFL Network Results

Having let the tests outlined in my previous post run, I can say that the results are inconclusive as the optimised number of neurons in the hidden layer turns out to be one, the minimum possible to even have a hidden layer. This leads me to believe that the raw features, the ideal cyclic tau embedding, are sufficient in and of themselves for the purpose I have in mind.

To that end, I have indulged in a bit of feature engineering and written the following Octave function,
## Copyright (C) 2019 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## www.gnu.org.

## -*- texinfo -*- 
## @deftypefn {} {@var{emb1}, @var{emb2} =} cyclic_embedding (@var{price}, @var{period})
##
## Inputs are a price vector and a period vector.
##
## The function normalises the price between -1 and +1 over an adaptive period lookback.
##
## The outputs are two matrices of 3 columns each - the first column is normalised price
## and the second and third are delay embeddings of Tau and 2 x Tau, Tau being equal to
## one quarter of the adaptive period vector, the theoretical ideal Tau for sinusoidal
## waveforms.
##
## The first output matrix, EMB1, normalises the Tau and 2 x Tau columns according to the
## most recent max high/min low; EMB2 Tau and 2 x Tau are normalised according to the 
## max high/min low in force at the delay embedding time.
##
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2019-09-18

function [ emb1 , emb2 , prob_matrix ] = cyclic_embedding( price , period )
  
price_smooth = price ;
emb1 = repmat( price , 1 , 3 ) ;
emb2 = emb1 ;
coeffs = generalised_sgolay_filter_coeffs( 5 , 2 , 0 ) ; coeffs = coeffs' ;

## initialising loop
price_smooth( 1 : 3 ) = coeffs( 1 : 3 , : ) * price( 1 : 5 ) ;  
for ii = 4 : 48 
  price_smooth( ii ) = coeffs( 3 , : ) * price( ii - 2 : ii + 2 ) ;
endfor
price_smooth( 49 : 50 ) = coeffs( 4 : 5 , : ) * price( 46 : 50 ) ;
## end initialising loop 

coeffs( 1 : 2 , : ) = [] ;

for ii = 51 : size( price , 1 )

  price_smooth( ii - 2 : ii ) = coeffs * price( ii - 4 : ii ) ;
  max_r = max( price_smooth( ii - period( ii ) : ii ) ) ;
  min_r = min( price_smooth( ii - period( ii ) : ii ) ) ;
  
  ## period is exactly divisable by 4 ( and 2 ), e.g. 8 12 16 20 24 28 32 36 40 44 48 etc?
  if ( rem( period( ii ) , 4 ) == 0 ) 
    
  emb1( ii , 1 ) = 2 * ( ( price_smooth( ii ) - min_r ) / ( max_r - min_r ) - 0.5 ) ;
  emb1( ii , 2 ) = 2 * ( ( price_smooth( ii - round( period( ii ) / 4 ) ) - min_r ) / ( max_r - min_r ) - 0.5 ) ;
  emb1( ii , 3 ) = 2 * ( ( price_smooth( ii - round( period( ii ) / 2 ) ) - min_r ) / ( max_r - min_r ) - 0.5 ) ;
  emb2( ii , 1 ) = emb1( ii , 1 ) ;
  emb2( ii , 2 ) = emb2( ii - round( period( ii ) / 4 ) , 1 ) ;
  emb2( ii , 3 ) = emb2( ii - round( period( ii ) / 2 ) , 1 ) ;
  
  ## periods 10 14 18 22 26 30 34 38 42 46 50
  elseif ( rem( period( ii ) , 2 ) == 0 && rem( period( ii ) , 4 ) == 2 )

  emb1( ii , 1 ) = 2 * ( ( price_smooth( ii ) - min_r ) / ( max_r - min_r ) - 0.5 ) ;
  emb1( ii , 2 ) = 2 * ( ( ( 0.5*price_smooth( ii - round( period( ii ) / 4 ) ) + 0.5*price_smooth( ii - round( period( ii ) / 4 ) + 1 ) ) - min_r )...
                       / ( max_r - min_r ) - 0.5 ) ;
  emb1( ii , 3 ) = 2 * ( ( price_smooth( ii - round( period( ii ) / 2 ) ) - min_r ) / ( max_r - min_r ) - 0.5 ) ;
  emb2( ii , 1 ) = emb1( ii , 1 ) ;
  emb2( ii , 2 ) = 0.5*emb2( ii - round( period( ii ) / 4 ) , 1 ) + 0.5*emb2( ii - round( period( ii ) / 4 ) + 1 , 1 ) ;
  emb2( ii , 3 ) = emb2( ii - round( period( ii ) / 2 ) , 1 ) ;

  ## periods 9 13 17 21 25 29 33 37 41 45 49
  elseif ( rem( period( ii ) , 2 ) == 1 && rem( period( ii ) , 4 ) == 1 )

  emb1( ii , 1 ) = 2 * ( ( price_smooth( ii ) - min_r ) / ( max_r - min_r ) - 0.5 ) ;
  emb1( ii , 2 ) = 2 * ( ( ( 0.75*price_smooth( ii - round( period( ii ) / 4 ) ) + 0.25*price_smooth( ii - round( period( ii ) / 4 ) - 1 ) ) - min_r )...
                       / ( max_r - min_r ) - 0.5 ) ;
  emb1( ii , 3 ) = 2 * ( ( ( 0.5*price_smooth( ii - round( period( ii ) / 2 ) ) + 0.5*price_smooth( ii - round( period( ii ) / 2 ) + 1 ) ) - min_r )...
                       / ( max_r - min_r ) - 0.5 ) ;
  emb2( ii , 1 ) = emb1( ii , 1 ) ;
  emb2( ii , 2 ) = 0.75*emb2( ii - round( period( ii ) / 4 ) , 1 ) + 0.25*emb2( ii - round( period( ii ) / 4 ) - 1 , 1 ) ;
  emb2( ii , 3 ) = 0.5*emb2( ii - round( period( ii ) / 2 ) , 1 ) + 0.5*emb2( ii - round( period( ii ) / 2 ) + 1 , 1 ) ;
  
  ## periods 11 15 19 23 27 31 35 39 43 47
  elseif ( rem( period( ii ) , 2 ) == 1 && rem( period( ii ) , 4 ) == 3 )

  emb1( ii , 1 ) = 2 * ( ( price_smooth( ii ) - min_r ) / ( max_r - min_r ) - 0.5 ) ;
  emb1( ii , 2 ) = 2 * ( ( ( 0.75*price_smooth( ii - round( period( ii ) / 4 ) ) + 0.25*price_smooth( ii - round( period( ii ) / 4 ) + 1 ) ) - min_r )...
                       / ( max_r - min_r ) - 0.5 ) ;
  emb1( ii , 3 ) = 2 * ( ( ( 0.5*price_smooth( ii - round( period( ii ) / 2 ) ) + 0.5*price_smooth( ii - round( period( ii ) / 2 ) + 1 ) ) - min_r )...
                       / ( max_r - min_r ) - 0.5 ) ;
  emb2( ii , 1 ) = emb1( ii , 1 ) ;
  emb2( ii , 2 ) = 0.75*emb2( ii - round( period( ii ) / 4 ) , 1 ) + 0.25*emb2( ii - round( period( ii ) / 4 ) + 1 , 1 ) ;
  emb2( ii , 3 ) = 0.5*emb2( ii - round( period( ii ) / 2 ) , 1 ) + 0.5*emb2( ii - round( period( ii ) / 2 ) + 1 , 1 ) ;
  
  endif
  
endfor ## end of embedding features creation

feature_peak = emb2 * [ 1 0 ; -1 1 ; 0 -1 ] * [ 1 ; 1 ] ;
feature_trough = zeros( size( feature_peak ) ) ;
ix = find( feature_peak < 0 ) ; 
feature_trough( ix ) = abs( feature_peak( ix ) ) ;
feature_peak( feature_peak <= 0 ) = 0 ;

## https://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best
## Weibull distribution with shape = 219.68 and scale = 1.94 for turn +/- 1 bar
## Weibull distribution with shape = 85.88 and scale = 1.84 for turn +/- 2 bar
## This comes from bayesian testing of cutoff value of feature_peak/feature_trough for highs/lows +/- 1 bar.
## The function used to get these values is "bayes_train_cyclic_turn_prob_of_embedding.m" which calls
## "bayes_optim_of_cyclic_embedding_conv_function" in /home/dekalog/Documents/octave/turning_points
scale = 1.84 ; shape = 85.88 ;
prob_matrix = zeros( size( feature_peak , 1 ) , 3 ) ;
prob_matrix( : , 1 ) = wblcdf( feature_peak , scale , shape ) ;
prob_matrix( : , 2 ) = wblcdf( feature_trough , scale , shape ) ;
prob_matrix( : , 3 ) = 1 .- sum( prob_matrix( : , 1 : 2 ) , 2 ) ;

endfunction
which again is a work in progress.

This takes as input a price and a period vector and outputs features as per the plots in the above linked ideal cyclic tau post plus a matrix of probabilities for price action being at a cyclic high/low. This matrix is the result of Monte Carlo Bayesian Optimisation over ideal sine wave prices to get a cutoff value for the derived feature in the above function. The probability distribution used for this probability matrix is the Weibull distribution, which has been determined by following the routine outlined in this "how to determine which distribution fits my data best" forum post and using the R statistical software platform.

The hard coded values for the cutoff in the above code are from the results of optimisation on pure sine waves. As I write this post there are optimisation routines running on sine waves with 20 db noise added.

More in due course.

Wednesday 13 November 2019

Random Vector Functional Link Networks

In my last post I briefly mentioned Random Vector Functional Link networks and that this post would be about them, inspired by the fact that the preliminary results of the last post suggest a shallow rather than a deep network structure.

The idea of RVFL networks is over two decades old and has perhaps been over shadowed by the more recent Extreme Learning Machine, although as mentioned in my last post there is some controversy about plagiarism with regard to ELMs. An RVFL network is basically an ELM with additional direct connections from the input layer to the output layer. The connections from the input layer to the single hidden layer are randomly generated and then fixed, the hidden layer is concatenated with the original input layer to form a new layer, H, and the connections from H to the output layer are solved in one step using the Moore Penrose inverse to get the Linear least squares solution, or alternatively using Regularised least squares. The advantage of this closed-form approach is the fast training times compared with more general optimisation routines such as gradient descent.

The above linked paper details a series of comparative tests run over various configurations of RVFL networks over a number of different data sets. Some of the main conclusions drawn are:
  1. the direct links from input to output enhance network performance
  2. whether to include output bias or not is a data dependent tunable factor
  3. the radial basis function for the hidden units always leads to better performance
  4. regularised least squares (ridge regression) performs better than the Moore Penrose inverse
Based on the code provided by the study's authors I have written the following two objective functions for use with the BayesOpt Library.
## Copyright (C) 2019 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{J} =} rvfl_training_of_cyclic_embedding_with_cv (@var{x})
##
## Function for Bayesian training of RVFL networks with fixed parameteres of:
##
##    direct links,
##    radial basis function activation,
##    ridge regression for regularized least squares,
##
## and optimisable parameters of:
##
##    number of neurons in hidden layer,
##    lambda for the least squares regression,
##    scaling of hidden layer inputs,
##    with or without an output bias.
##
## The input X is a vector of 6 values to be optimised by the BayesOpt library
## function 'bayesoptcont.'
## The output J is the Brier Score for the test fold cross validated data.
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2019-11-04

function J = rvfl_training_of_cyclic_embedding_with_cv ( x )
global sample_features ; global sample_targets ;
epsilon = 1e-15 ; ## to ensure log() does not give out a nan

Nfea = size( sample_features , 2 ) ;

## check input x
if ( numel( x ) != 6 )
  error( 'The input vector x must be of length 6.' ) ;
endif

## get the parameters from input x
hidden_layer_size = floor( x( 1 ) ) ;    ## number of neurons in hidden layer
randomisation_type = floor( x( 2 ) ) ;   ## 1 == uniform, 2 == Gaussian
scale_mode = floor( x( 3 ) ) ;           ## 1 will scale the features for all neurons, 2 will scale the features for each hidden
##                                          neuron separately, 3 will scale the range of the randomization for uniform distribution
scale = x( 4 ) ;                         ## Linearly scale the random features before feeding into the nonlinear activation function. 
##                                          In this implementation, we consider the threshold which leads to 0.99 of the maximum/minimum 
##                                          value of the activation function as the saturating threshold.
##                                          scale = 0.9 means all the random features will be linearly scaled
##                                          into 0.9 * [ lower_saturating_threshold , upper_saturating_threshold ].
if_output_bias = floor( x( 5 ) + 0.5 ) ; ## Use output bias, or not? 1 == yes , 0 == no.
lambda = x( 6 ) ;                        ## the regularization coefficient lambda 

length_jj_loop = 25 ;
all_brier_values = zeros( length_jj_loop , 1 ) ;  

##rand( 'seed' , 0 ) ;
##randn( 'seed' , 0 ) ;
##U_sample_targets = unique( sample_targets ) ;
##nclass = numel( U_sample_targets ) ;

##sample_targets_temp = zeros( numel( sample_targets ) , nclass ) ; 
##
#### get the 0 - 1 one hot coding for the target,  
##for i = 1 : nclass
##  idx = sample_targets == U_sample_targets( i ) ;
##  sample_targets_temp( idx , i ) = 1 ;
##endfor

###### information for splitting into training and test sets ###############
ix_positive_targets = find( sample_targets == 1 ) ;
ix_negative_targets = ( 1 : numel( sample_targets ) )' ; 
ix_negative_targets( ix_positive_targets ) = [] ;
## split 20/80
split_no1 = round( 0.2 * numel( ix_positive_targets ) ) ;
split_no2 = round( 0.2 * numel( ix_negative_targets ) ) ;

######### get type of randomisation from input x #################
if ( randomisation_type == 1 ) ## uniform randomisation 
  
  if ( scale_mode == 3 ) ## range scaled for uniform randomisation
     Weight = scale * ( rand( Nfea , hidden_layer_size ) * 2 - 1 ) ; ## scaled uniform random input weights to hidden layer
     Bias = scale * rand( 1 , hidden_layer_size ) ;                  ## scaled random bias weights to hidden layer
  else
     Weight = rand( Nfea , hidden_layer_size ) * 2 - 1 ; ## unscaled random input weights to hidden layer
     Bias = rand( 1 , hidden_layer_size ) ;              ## unscaled random bias weights to hidden layer
  endif
    
elseif ( randomisation_type == 2 ) ## gaussian randomisation
  Weight = randn( Nfea , hidden_layer_size ) ; ## gaussian random input weights to hidden layer
  Bias = randn( 1 , hidden_layer_size ) ;      ## gaussian random bias weights to hidden layer
else
  error( 'only Gaussian and Uniform are supported' )
endif
############################################################################

## Activation Function    
Saturating_threshold = [ -2.1 , 2.1 ] ;
Saturating_threshold_activate = [ 0 , 1 ] ;

for jj = 1 : length_jj_loop

## shuffle
randperm1 = randperm( numel( ix_positive_targets) ) ;
randperm2 = randperm( numel( ix_negative_targets) ) ;
test_ix1 = ix_positive_targets( randperm1( 1 : split_no1 ) ) ;
test_ix2 = ix_negative_targets( randperm2( 1 : split_no2 ) ) ;
test_ix = [ test_ix1 ; test_ix2 ] ;
train_ix1 = ix_positive_targets( randperm1( split_no1 + 1 : end ) ) ;
train_ix2 = ix_negative_targets( randperm2( split_no2 + 1 : end ) ) ;
train_ix = [ train_ix1 ; train_ix2 ] ;

sample_targets_train = sample_targets( train_ix ) ;
sample_features_train = sample_features( train_ix , : ) ;

Nsample = size( sample_features_train , 1 ) ;

Bias_train = repmat( Bias , Nsample , 1 ) ;
H = sample_features_train * Weight + Bias_train ;

if ( scale_mode == 1 )
  ## scale the features for all neurons 
  [ H , k , b ] = Scale_feature( H , Saturating_threshold , scale ) ;
elseif ( scale_mode == 2 ) 
  ## else scale the features for each hidden neuron separately
  [ H , k , b ] = Scale_feature_separately( H , Saturating_threshold , scale ) ;
endif
  
## actual activation, the radial basis function
H = exp( -abs( H ) ) ;

if ( if_output_bias == 1 )
  ## we will use an output bias
  H = [ H , ones( Nsample , 1 ) ] ; 
endif

## the direct link scaling options, concatenate hidden layer and sample_features_train 
if ( scale_mode == 1 )
  ## scale the features for all neurons
  sample_features_train = sample_features_train .* k + b ;
  H = [ H , sample_features_train ] ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  [ sample_features_train , ktr , btr ] = Scale_feature_separately( sample_features_train , Saturating_threshold_activate , scale ) ;
  H = [ H , sample_features_train ] ;
else
  H = [ H , sample_features_train ] ;
endif

H( isnan( H ) ) = 0 ; ## avoids any 'blowups' due to nans in H

## do the regularized least squares for concatenated hidden layer output
## and the original, possibly scaled, input sample_features
if ( hidden_layer_size < Nsample )
 beta = ( eye( size( H , 2 ) ) / lambda + H' * H ) \ H' * sample_targets_train ;
else
 beta = H' * ( ( eye( size( H , 1 ) ) / lambda + H * H' ) \ sample_targets_train ) ; 
endif

############# now the test on test data ####################################
Bias_test = repmat( Bias , numel( sample_targets( test_ix ) ) , 1 ) ;
H_test = sample_features( test_ix , : ) * Weight + Bias_test ;

if ( scale_mode == 1 )
  ## scale the features for all neurons
  H_test = H_test .* k + b ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  nSamtest = size( H_test , 1 ) ; 
  kt = repmat( k , nSamtest , 1 ) ;
  bt = repmat( b , nSamtest , 1 ) ;
  H_test = H_test .* kt + bt ;
endif

## actual activation, the radial basis function
H_test = exp( -abs( H_test ) ) ;

if ( if_output_bias == 1 )
  ## we will use an output bias
  H_test = [ H_test , ones( numel( sample_targets( test_ix ) ) , 1 ) ] ; 
endif

## the direct link scaling options, concatenate hidden layer and sample_features_train 
if ( scale_mode == 1 )
  ## scale the features for all neurons 
  testX_temp = sample_features( test_ix , : ) .* k + b ;
  H_test = [ H_test , testX_temp ] ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  nSamtest = size( H_test , 1 ) ; 
  kt = repmat( ktr , nSamtest , 1 ) ;
  bt = repmat( btr , nSamtest , 1 ) ;
  testX_temp = sample_features( test_ix , : ) .* kt + bt ;   
  H_test = [ H_test , testX_temp ] ; 
else
  H_test = [ H_test , sample_features( test_ix , : ) ] ; 
endif

H_test( isnan( H_test ) ) = 0 ; ## avoids any 'blowups' due to nans in H_test

## get the test predicted target output
test_targets = H_test * beta ;

##Y_temp = zeros( Nsample , 1 ) ;
##% decode the target output
##for i = 1 : Nsample
##    [ maxvalue , idx ] = max( sample_targets_temp( i , : ) ) ;
##    Y_temp( i ) = U_sample_targets( idx ) ;
##endfor

############################################################################

## the final logistic output
final_output = 1.0 ./ ( 1.0 .+ exp( -test_targets ) ) ;

## get the Brier_score
## https://en.wikipedia.org/wiki/Brier_score
all_brier_values( jj ) = mean( ( final_output .- sample_targets( test_ix ) ) .^ 2 ) ;

rand( 'state' ) ; randn( 'state' ) ; ## reset rng

endfor ## end of jj loop

J = mean( all_brier_values ) ;

endfunction

## Various measures of goodness
## https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models
## https://www.fharrell.com/post/classification/
## https://stats.stackexchange.com/questions/433628/what-is-a-reliable-measure-of-accuracy-for-logistic-regression
## https://www.jstatsoft.org/article/view/v090i12
## https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
## https://stats.stackexchange.com/questions/319666/aic-with-test-data-is-it-possible
## https://www.learningmachines101.com/lm101-076-how-to-choose-the-best-model-using-aic-or-gaic/
## https://stackoverflow.com/questions/48185090/how-to-get-the-log-likelihood-for-a-logistic-regression-model-in-sklearn
## https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients
## https://stats.stackexchange.com/questions/163221/whats-the-measure-to-assess-the-binary-classification-accuracy-for-imbalanced-d
## https://stats.stackexchange.com/questions/168929/logistic-regression-is-predicting-all-1-and-no-0
## https://stats.stackexchange.com/questions/435307/multiple-linear-regression-lse-when-one-of-parameter-is-known

These functions stand on the shoulders of the above and hard code direct links, radial basis function activation and ridge regression, with the number of neurons in the hidden layer, lambda for the ridge regression, different scaling options and inclusion of output bias or not as the optimisable parameters. The function minimisation objective is the Brier score.

This second function is slightly different in that the Akaike information criterion is the minimisation objective and there is the option to use the Netlab Generalised linear model function to solve for the hidden to output weights (comment out the relevant code as necessary.)
## Copyright (C) 2019 dekalog
## 
## This program is free software: you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## 
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
## 
## You should have received a copy of the GNU General Public License
## along with this program.  If not, see
## .

## -*- texinfo -*- 
## @deftypefn {} {@var{J} =} rvfl_training_of_cyclic_embedding (@var{x})
##
## Function for Bayesian training of RVFL networks with fixed parameteres of:
##
##    direct links,
##    radial basis function activation,
##    ridge regression for regularized least squares,
##
## and optimisable parameters of:
##
##    number of neurons in hidden layer,
##    lambda for the least squares regression,
##    scaling of hidden layer inputs,
##    with or without an output bias.
##
## The input X is a vector of 6 values to be optimised by the BayesOpt library
## function 'bayesoptcont.'
## The output J is the AIC value for the tested model.
## @seealso{}
## @end deftypefn

## Author: dekalog 
## Created: 2019-11-04

function J = rvfl_training_of_cyclic_embedding ( x )
global sample_features ; global sample_targets ;
epsilon = 1e-15 ; ## to ensure log() does not give out a nan

## check input x
if ( numel( x ) != 4 )
  error( 'The input vector x must be of length 6.' ) ;
endif

## get the parameters from input x
hidden_layer_size = floor( x( 1 ) ) ;    ## number of neurons in hidden layer
randomisation_type = floor( x( 2 ) ) ;   ## 1 == uniform, 2 == Gaussian
scale_mode = floor( x( 3 ) ) ;           ## 1 will scale the features for all neurons, 2 will scale the features for each hidden
##                                          neuron separately, 3 will scale the range of the randomization for uniform distribution
scale = x( 4 ) ;                         ## Linearly scale the random features before feeding into the nonlinear activation function. 
##                                          In this implementation, we consider the threshold which leads to 0.99 of the maximum/minimum 
##                                          value of the activation function as the saturating threshold.
##                                          scale = 0.9 means all the random features will be linearly scaled
##                                          into 0.9 * [ lower_saturating_threshold , upper_saturating_threshold ].
##if_output_bias = floor( x( 5 ) + 0.5 ) ; ## Use output bias, or not? 1 == yes , 0 == no.
##lambda = x( 6 ) ;                        ## the regularization coefficient lambda 

##length_jj_loop = 25 ;
##all_aic_values = zeros( length_jj_loop , 1 ) ;  

rand( 'seed' , 0 ) ;
randn( 'seed' , 0 ) ;
##U_sample_targets = unique( sample_targets ) ;
##nclass = numel( U_sample_targets ) ;

##sample_targets_temp = zeros( numel( sample_targets ) , nclass ) ; 
##
#### get the 0 - 1 one hot coding for the target,  
##for i = 1 : nclass
##  idx = sample_targets == U_sample_targets( i ) ;
##  sample_targets_temp( idx , i ) = 1 ;
##endfor

sample_targets_temp = sample_targets ;

[ Nsample , Nfea ] = size( sample_features ) ;

######### get type of randomisation from input x #################
if ( randomisation_type == 1 ) ## uniform randomisation 
  
  if ( scale_mode == 3 ) ## range scaled for uniform randomisation
     Weight = scale * ( rand( Nfea , hidden_layer_size ) * 2 - 1 ) ; ## scaled uniform random input weights to hidden layer
     Bias = scale * rand( 1 , hidden_layer_size ) ;                  ## scaled random bias weights to hidden layer
  else
     Weight = rand( Nfea , hidden_layer_size ) * 2 - 1 ; ## unscaled random input weights to hidden layer
     Bias = rand( 1 , hidden_layer_size ) ;              ## unscaled random bias weights to hidden layer
  endif
    
elseif ( randomisation_type == 2 ) ## gaussian randomisation
  Weight = randn( Nfea , hidden_layer_size ) ; ## gaussian random input weights to hidden layer
  Bias = randn( 1 , hidden_layer_size ) ;      ## gaussian random bias weights to hidden layer
else
  error( 'only Gaussian and Uniform are supported' )
endif
############################################################################

Bias_train = repmat( Bias , Nsample , 1 ) ;
H = sample_features * Weight + Bias_train ;

k_parameters = numel( Weight ) + numel( Bias_train ) ;

## Activation Function    
Saturating_threshold = [ -2.1 , 2.1 ] ;
Saturating_threshold_activate = [ 0 , 1 ] ;

if ( scale_mode == 1 )
  ## scale the features for all neurons 
  [ H , k , b ] = Scale_feature( H , Saturating_threshold , scale ) ;
elseif ( scale_mode == 2 ) 
  ## else scale the features for each hidden neuron separately
  [ H , k , b ] = Scale_feature_separately( H , Saturating_threshold , scale ) ;
endif
  
## actual activation, the radial basis function
H = exp( -abs( H ) ) ;

## glm training always applies a bias, so comment out if training with netlab glm
##if ( if_output_bias == 1 )
##  ## we will use an output bias
##  H = [ H , ones( Nsample , 1 ) ] ; 
##endif

## the direct link scaling options, concatenate hidden layer and sample_features 
if ( scale_mode == 1 )
  ## scale the features for all neurons
  sample_features_temp = sample_features .* k + b ;
  H = [ H , sample_features_temp ] ;
elseif ( scale_mode == 2 )
  ## else scale the features for each hidden neuron separately
  [ sample_features_temp , ktr , btr ] = Scale_feature_separately( sample_features , Saturating_threshold_activate , scale ) ;
  H = [ H , sample_features_temp ] ;
else
  H = [ H , sample_features ] ;
endif

H( isnan( H ) ) = 0 ; ## avoids any 'blowups' due to nans in H

############ THE ORIGINAL REGULARISED LEAST SQUARES CODE ###################
## do the regularized least squares for concatenated hidden layer output            
## and the original, possibly scaled, input sample_features                         
##if ( hidden_layer_size < Nsample )                                                  
## beta = ( eye( size( H , 2 ) ) / lambda + H' * H ) \ H' * sample_targets_temp ;     
##else                                                                                
## beta = H' * ( ( eye( size( H , 1 ) ) / lambda + H * H' ) \ sample_targets_temp ) ; 
##endif                                                                               
############################################################################

##k_parameters = k_parameters + numel( beta ) ;

## get the model predicted target output
##sample_targets_temp = H * beta ;

## the final logistic output
##final_output = 1.0 ./ ( 1.0 .+ exp( -sample_targets_temp ) ) ;

############ REPLACED BY GLM TRAINING USING NETLAB #########################
net = glm( size( H , 2 ) , 1 , 'logistic' ) ; ## Create a generalized linear model structure.
options = foptions ; ## Set default parameters for optimisation routines, for compatibility with MATLAB's foptions()
options( 1 ) = -1 ;   ## change default value
## OPTIONS(1) is set to 1 to display error values during training. If
## OPTIONS(1) is set to 0, then only warning messages are displayed.  If
## OPTIONS(1) is -1, then nothing is displayed.
options( 14 ) = 5 ; ## change default value
## OPTIONS(14) is the maximum number of iterations for the IRLS
## algorithm;  default 100.
net = glmtrain( net , options , H , sample_targets ) ;

k_parameters = k_parameters + net.nwts ;

## get output of trained glm model
final_output = glmfwd( net , H ) ;
############################################################################

##Y_temp = zeros( Nsample , 1 ) ;
##% decode the target output
##for i = 1 : Nsample
##    [ maxvalue , idx ] = max( sample_targets_temp( i , : ) ) ;
##    Y_temp( i ) = U_sample_targets( idx ) ;
##endfor

############################################################################

##  https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/
##
##  likelihood = yhat * y + (1 – yhat) * (1 – y)
##
##  We can update the likelihood function using the log to transform it into a log-likelihood function:
##
##      log-likelihood = log(yhat) * y + log(1 – yhat) * (1 – y)

##  Finally, we can sum the likelihood function across all examples in the dataset to maximize the likelihood:
##
##      maximize sum i to n log(yhat_i) * y_i + log(1 – yhat_i) * (1 – y_i)

log_likelihood = sum( log( final_output .+ epsilon ) .* sample_targets + log( 1 .- final_output .+ epsilon ) .* ( 1 .- sample_targets ) ) ;

## get Akaike Information criteria
J = 2 * k_parameters - 2 * log_likelihood ;

## get the Brier_score
## https://en.wikipedia.org/wiki/Brier_score
##J = mean( ( final_output .- sample_targets_temp ) .^ 2 ) ;

rand( 'state' ) ; randn( 'state' ) ; ## reset rng

endfunction

## Various measures of goodness
## https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models
## https://www.fharrell.com/post/classification/
## https://stats.stackexchange.com/questions/433628/what-is-a-reliable-measure-of-accuracy-for-logistic-regression
## https://www.jstatsoft.org/article/view/v090i12
## https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
## https://stats.stackexchange.com/questions/319666/aic-with-test-data-is-it-possible
## https://www.learningmachines101.com/lm101-076-how-to-choose-the-best-model-using-aic-or-gaic/
## https://stackoverflow.com/questions/48185090/how-to-get-the-log-likelihood-for-a-logistic-regression-model-in-sklearn
## https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients
## https://stats.stackexchange.com/questions/163221/whats-the-measure-to-assess-the-binary-classification-accuracy-for-imbalanced-d
## https://stats.stackexchange.com/questions/168929/logistic-regression-is-predicting-all-1-and-no-0
## https://stats.stackexchange.com/questions/435307/multiple-linear-regression-lse-when-one-of-parameter-is-known
Both of these functions are working code and are heavily commented and perhaps not very polished.

As I write this post I have various tests running in the background and will report on the results in due course.

Friday 1 November 2019

Preliminary Results from Weight Agnostic Training

Following on from my last post, below is a selection of the typical resultant output from the Bayesopt Library minimisation
    3    3    2    2    2    8   99   22   30    1
    3    3    2    3    2   39    9   25   25    1
    2    2    3    2    2   60   43   83   54    3
    2    1    2    2    2    2    0   90   96   43
    3    2    3    2    2    2    2   43   33    1
    2    3    2    3    2    2    0   62   98   21
    2    2    2    2    2   18   43   49    2    2
    2    3    2    4    1    2    0   23    0    0
    2    2    1    2    3    2    0   24   63   65
    3    2    2    2    3    5   92   49    1    0
    2    3    2    1    1    7   84   22   17    1
    3    2    4    1    1   46    1    0   99    7
    2    2    3    2    2    2    0   74   82   50
    3    3    2    2    2   45   14   81   23    2
    2    3    3    2    2    2    0   99   79    4
    2    2    2    2    2    2    0    0   68    0
    3    3    3    2    2   67   17   37   84    1
    3    2    3    2    2   24   39   56   55    1
    3    3    4    3    2    2   30   62   67    1
    2    2    2    2    2    2    1    0    4    0
    2    2    2    2    2    2    9    8   45    1
    2    3    3    2    2   48    1   18   28    1
    2    3    3    2    2    2    0   34   42   18
    2    2    2    3    2    2    0   70   81   10
    2    2    3    3    2    2    0   85   23   11
where the rows are separate run's results, the first five columns show the type of activation function per layer and the last five columns show the number of neurons per layer ( see function code in my last post for details. )

Some quick take aways from this are:
  1. the sigmoid activation function ( bounded 0 to 1 ) is not favoured, with either the Tanh or "Lecun" sigmoid ( see section 4.4 of this paper ) being preferred
  2. 40% of the network architectures are single hidden layer with just 2 neurons in the hidden layer 
  3. 8% have only two hidden layers with the second hidden layer having just one neuron
I would say these preliminary results suggest that a deep architecture is not necessary for the features/targets being tested and obviously the standard sigmoid/logistic function should be avoided.

As is my wont whilst waiting for lengthy computer tests to complete I have also been browsing online, motivated by the early results of the above, and discovered Random Vector Functional Link Networks, which seem to be a precursor to Extreme Learning Machines. However, there appears to be some controversy about whether or not Extreme Learning Machines are just plagiarism of earlier ideas, such as RVFL networks. 

Readers may remember that I have used ELMs before with the Bayesopt library ( see my post here ) and now that the above results point towards the use of shallow networks I intend to replicate the above, but using RVFL networks. This will be the subject of my next post.