## Friday, 1 November 2019

### Preliminary Results from Weight Agnostic Training

Following on from my last post, below is a selection of the typical resultant output from the Bayesopt Library minimisation
    3    3    2    2    2    8   99   22   30    1
3    3    2    3    2   39    9   25   25    1
2    2    3    2    2   60   43   83   54    3
2    1    2    2    2    2    0   90   96   43
3    2    3    2    2    2    2   43   33    1
2    3    2    3    2    2    0   62   98   21
2    2    2    2    2   18   43   49    2    2
2    3    2    4    1    2    0   23    0    0
2    2    1    2    3    2    0   24   63   65
3    2    2    2    3    5   92   49    1    0
2    3    2    1    1    7   84   22   17    1
3    2    4    1    1   46    1    0   99    7
2    2    3    2    2    2    0   74   82   50
3    3    2    2    2   45   14   81   23    2
2    3    3    2    2    2    0   99   79    4
2    2    2    2    2    2    0    0   68    0
3    3    3    2    2   67   17   37   84    1
3    2    3    2    2   24   39   56   55    1
3    3    4    3    2    2   30   62   67    1
2    2    2    2    2    2    1    0    4    0
2    2    2    2    2    2    9    8   45    1
2    3    3    2    2   48    1   18   28    1
2    3    3    2    2    2    0   34   42   18
2    2    2    3    2    2    0   70   81   10
2    2    3    3    2    2    0   85   23   11
where the rows are separate run's results, the first five columns show the type of activation function per layer and the last five columns show the number of neurons per layer ( see function code in my last post for details. )

Some quick take aways from this are:
1. the sigmoid activation function ( bounded 0 to 1 ) is not favoured, with either the Tanh or "Lecun" sigmoid ( see section 4.4 of this paper ) being preferred
2. 40% of the network architectures are single hidden layer with just 2 neurons in the hidden layer
3. 8% have only two hidden layers with the second hidden layer having just one neuron
I would say these preliminary results suggest that a deep architecture is not necessary for the features/targets being tested and obviously the standard sigmoid/logistic function should be avoided.

As is my wont whilst waiting for lengthy computer tests to complete I have also been browsing online, motivated by the early results of the above, and discovered Random Vector Functional Link Networks, which seem to be a precursor to Extreme Learning Machines. However, there appears to be some controversy about whether or not Extreme Learning Machines are just plagiarism of earlier ideas, such as RVFL networks.

Readers may remember that I have used ELMs before with the Bayesopt library ( see my post here ) and now that the above results point towards the use of shallow networks I intend to replicate the above, but using RVFL networks. This will be the subject of my next post.