Dekalog Blog: Fast Fourier Transform

Showing posts with label Fast Fourier Transform. Show all posts

Tuesday, 23 September 2014

High Resolution Tools for Spectral Analysis - Update

Following on from my initial enthusiasm for the code on the High Resolution Tools for Spectral Analysis page, I have say that I have been unable to get the code performing as I would like it for my intended application to price time series.

My original intent was to use the zero crossing period estimation function, the subject of my last few posts, to get a rough idea of the dominant cycle period and then use the most recent data in a rolling window of this length as input to the high resolution code. This approach, however, ran into problems.

Firstly, windows of just the dominant cycle length (approximately 10 to 30 data points only) would lead to all sorts of errors being thrown from the toolkit functions as well as core Octave functions, such as divide by zero warnings and cryptic error messages that even now I don't understand. My best guess here is that the amount of data available in such short windows is simply insufficient for the algorithm to work, in much the same way as the Fast Fourier transform may fail to work if given too little data that is not a power of 2 in length. It might be possible to substantially rewrite the relevant functions, but my understanding of the algorithm and the inner workings of Octave means this is well beyond my pay grade.

My second approach was to simply extend the amount of data available by using the Octave repmat function on the windowed data so that all the above errors disappeared. This had very hit and miss results - sometimes they were very accurate, other times just fair to middling, and on occasion just way off target. I suspect here that the problem is the introduction of signal artifacts via the use of the repmat function, which results in Aliasing of the underlying signal.

As a result I shall not continue with investigating this toolbox for now. Although I only investigated the use of the me.m function (there are other functions available) I feel that my time at the moment can be used more productively.

Monday, 9 September 2013

Update on FFT Smoother

Having tested the FFT smoother on real price data and having been disappointed, I have decided not to pursue this any further for the moment. I think the problem is that for such short input periods ( 10 to 50 bars ) the FFT does not have enough frequency resolution and that by eliminating most of the frequencies from this short period I'm throwing out the baby with the bath water.

Wednesday, 4 September 2013

FFT Smoother Function Test

Following on from my last post I now want to show the FFT smoother function results on my usual idealised market data. The Octave script code for this test is given in the code box below.

clear all

% create the "price" vector
period = input( 'Enter period between 10 & 50 for underlying sine: ' ) ;
underlying_sine = sinewave( 500 , period ) ; % create underlying sinewave function with amplitude 2
trend_mult = input( 'Enter value for trend (a number between -4 & 4 incl.) to 4 decimal places: ' ) ;
trend = ( trend_mult / period ) .* (0:1:499) ;
noise_mult = input( 'Enter value for noise multiplier (a number between 0 & 1 incl.) to 2 decimal places: ' ) ; 
noise = noise_mult * randn(500,1)' ;
smooth_price = underlying_sine .+ trend ;
price = underlying_sine .+ trend .+ noise ;
super_smooth = super_smoother( price ) ;

smooth = price ;
mean_value = price ;
mean_smooth = price ;

for ii = period+1 : length( price )

mean_value(ii) = mean( price( ii-period : ii ) ) ;
fft_input = price( ii-(period-1) : ii ) .- mean_value(ii) ;
N = length( fft_input ) ;

data_in_freq_domain = fft( fft_input ) ;

if mod( N , 2 ) == 0 % N even
% The DC component is unique and should not be altered
% adjust value of element N / 2 to compensate for it being the sum of frequencies
data_in_freq_domain( N / 2 ) = abs( data_in_freq_domain( N / 2 ) ) / 2 ;

[ max1 , ix1 ] = max( abs( data_in_freq_domain( 2 : N / 2 ) ) ) ; % get the index of the first max value
ix1 = ix1 + 1 ;

[ max2 , ix2 ] = max( abs( data_in_freq_domain( N / 2 + 1 : end ) ) ) ; % get index of the second max value
ix2 = N / 2 + ix2 ;

non_zero_values_ix = [ 1 ix1 ix2 ] ; % indices for DC component, +ve freq max & -ve freq max

else % N odd
% The DC component is unique and should not be altered

[ max1 , ix1 ] = max( abs( data_in_freq_domain( 2 : ( N + 1 ) / 2 ) ) ) ; % get the index of the first max value
ix1 = ix1 + 1 ;

[ max2 , ix2 ] = max( abs( data_in_freq_domain( ( N + 1 ) / 2 + 1 : end ) ) ) ; % get index of the second max value
ix2 = ( N + 1 ) / 2 + ix2 ;

non_zero_values_ix = [ 1 ix1 ix2 ] ; % indices for DC component, +ve freq max & -ve freq max

end

% extract the constant term and frequencies of interest
non_zero_values = data_in_freq_domain( non_zero_values_ix ) ;

% now set all values to zero
data_in_freq_domain( 1 : end ) = 0.0 ;

% replace the non_zero_values
data_in_freq_domain( non_zero_values_ix ) = non_zero_values ;

inv = ifft( data_in_freq_domain ) ; % reverse the FFT

recovered_sig = real( inv ) ; % recover the real component of ifft for plotting

smooth( ii ) = recovered_sig( end ) + mean_value( ii ) ;

mean_smooth( ii ) = mean( smooth( ii-period : ii ) ) ;

end

smooth = smooth .+ ( mean_value .- mean_smooth ) ;

plot( smooth_price , 'r' ,price , 'c' , smooth , 'm' , super_smooth , 'g' )
legend( 'smooth price' , 'price' , 'fft smooth' , 'super smooth' )

This script prompts for terminal input for the values for an underlying sine wave period, a trend multiplier and a noise multiplier component to construct the "price" time series. Here is a plot of output with the settings

period = 20
trend multiplier = 1.3333
noise multiplier = 0.0

The cyan line is the "price," the magenta is the FFT smoother and the green is the "Super Smoother" filter shown for comparative purposes. As can be seen, after the "burn in" period for the FFT smoother calculations to settle down, the smoother is a perfect recovery of the price time series. However, this is without noise. With a noise multiplier value of 0.25 a typical plot is

where the underlying smooth "true" signal can now be seen in red. I think the performance of the magenta FFT smoother in recovering this "true" signal speaks for itself. My next post will show its performance on real price data.

Monday, 2 September 2013

Filtering in the Frequency Domain with Octave's FFT Function

Despite my previous post in which I talked about the "Super Smoother" filter, it was bugging me that I couldn't get my attempts at FFT filtering to work. However, thanks in large part to my recent discovery of two online tutorials here and here I think I've finally cracked it. Taking these two tutorials as a template I have written an interactive Octave script, which is shown in the code box below.

clear all

fprintf( '\nThis is an interactive Octave tutorial to explain the basic principles\n' )
fprintf( 'of the Fast Fourier transform and how it can be used to smooth a time series\n' )
fprintf( 'in the frequency domain.\n\n' )
fprintf( 'A Fourier transform approximates any continuous function as the sum of periodic\n' )
fprintf( 'functions (sines and cosines). The FFT does the same thing for discrete signals,\n' )
fprintf( 'a series of data points rather than a continuously defined function.\n' ) 
fprintf( 'The FFT lets you identify periodic components in your discrete signal.\n' )
fprintf( 'You might need to identify a periodic signal buried under random noise,\n' )
fprintf( 'or analyze a signal with several different periodic underlying sources.\n' )
fprintf( 'Octave includes a built-in implementation of FFT to help you do this.\n\n' )

fprintf( 'First, take this 20 period sine wave. Since we know the period, the\n' )
fprintf( 'frequency = 1 / period = 1 / 20 = 0.05 Hz.\n\n' )

% set the period
period = 20 ; 
underlying_sine = 2.0 .* sinewave( period , period ) ; % create underlying sinewave function with amplitude 2

plot( underlying_sine , 'b' )
title( 'A Sine Wave of period 20 with an amplitude of 2' ) 
fprintf( 'After viewing chart, press enter to continue.\n\n' )
pause ;

fprintf( 'Now, let us put this sine wave through the Octave fft function thus:\n\n' )
fprintf( ' data_in_freq_domain = fft( data ) ;  and plot this.\n\n' )

N = length( underlying_sine ) ;

% If length of fft vector is even, then the magnitude of the fft will be symmetric, such that the 
% first ( 1 + length / 2 ) points are unique, and the rest are symmetrically redundant. The DC component 
% of output is fft( 1 ) and fft( 1 + length / 2 ) is the Nyquist frequency component of vector. 
% If length is odd, however, the Nyquist frequency component is not evaluated, and the number of unique 
% points is ( length + 1 ) / 2 . This can be generalized for both cases to ceil( ( length + 1 ) / 2 ) 

data_in_freq_domain = fft( underlying_sine ) ; % take the fft of our underlying_sine
stem( abs( data_in_freq_domain ) ) ; % use abs command to get the magnitude.
xlabel( 'Sample Number, i.e. position in the plotted Octave fft output vector' )
ylabel( 'Amplitude' )
title( 'Using the basic Octave fft command' )
grid
axis( [ 0 , period + 1 , 0 , max(abs( data_in_freq_domain )) + 0.1 ] )

fprintf( 'It can be seen that the x-axis gives us no information about the frequencies\n' )
fprintf( 'and the spectrum is not centered around zero. Changing the x-axis to frequency\n' )
fprintf( 'and plotting a centered frequency plot (see code in file) gives this plot.\n\n' )

fprintf( 'Press enter to view centered frequency plot.\n\n' )
pause ;

% Fs is the sampling rate

% this part of the code generates the frequency axis
if mod( N , 2 ) == 0
    k = -N/2 : N/2-1 ; % N even
else
    k = -(N-1)/2 : (N-1)/2 ; % N odd
end

T = N / 1 ; %  where the frequency sampling rate Fs in T = N / Fs ; Fs is 1.

frequencyRange = k / T ; % the frequency axis

data_in_freq_domain = fft( underlying_sine ) / N ; % apply the FFT & normalize the data
centered_data_in_freq_domain = fftshift( data_in_freq_domain ) ; % shifts the fft data so that it is centered

% remember to take the abs of data_in_freq_domain to get the magnitude!
stem( frequencyRange , abs( centered_data_in_freq_domain ) ) ;
xlabel( 'Freq (Hz)' )
ylabel( 'Amplitude' )
title( 'Using the centered FFT function with Amplitude normalisation' )
grid
axis( [ min(frequencyRange) - 0.1 , max(frequencyRange) + 0.1 , 0 , max(abs( centered_data_in_freq_domain )) + 0.1 ] )

fprintf( 'In this centered frequency chart it can be seen that the frequency\n' )
fprintf( 'with the greatest amplitude is 0.05 Hz, i.e. period = 1 / 0.05 = 20\n' )
fprintf( 'which we know to be the correct period of the underlying sine wave.\n' )
fprintf( 'The information within the centered frequency spectrum is entirely symmetric\n' )
fprintf( 'around zero and, in general, the positive side of the spectrum is used.\n\n' )
fprintf( 'Now, let us look at two sine waves together.\n' )

fprintf( 'Press enter to continue.\n\n' )
pause ;

% underlying_sine = 2.0 .* sinewave( period , period ) ; % create underlying sinewave function with amplitude 2
underlying_sine_2 = 0.75 .* sinewave( period , 7 ) ; % create underlying sinewave function with amplitude 0.75
composite_sine = underlying_sine .+ underlying_sine_2 ;

plot( underlying_sine , 'r' , underlying_sine_2 , 'g' , composite_sine , 'b' )
legend( 'Underlying sine period 20' , 'Underlying sine period 7' , 'Combined waveform' )
title( 'Two Sine Waves of periods 20 and 7, with Amplitudes of 2 and 0.75 respectively, combined' ) 
fprintf( 'After viewing chart, press enter to continue.\n\n' )
pause ;

data_in_freq_domain = fft( composite_sine ) ; % apply the FFT
centered_data_in_freq_domain = fftshift( data_in_freq_domain / N ) ; % normalises & shifts the fft data so that it is centered

% remember to take the abs of data_in_freq_domain to get the magnitude!
stem( frequencyRange , abs( centered_data_in_freq_domain ) ) ;
xlabel( 'Freq (Hz)' )
ylabel( 'Amplitude' )
title( 'Using the centered FFT function with Amplitude normalisation on 2 Sine Waves' )
grid
axis( [ min(frequencyRange) - 0.1 , max(frequencyRange) + 0.1 , 0 , max(abs( centered_data_in_freq_domain )) + 0.1 ] )

fprintf( 'x-axis is frequencyRange.\n' )
frequencyRange
fprintf( '\nComparing the centered frequency chart with the frequencyRange vector now shown it\n' )
fprintf( 'can be seen that, looking at the positive side only, the maximum amplitude\n' )
fprintf( 'frequency is for period = 1 / 0.05 = a 20 period sine wave, and the next greatest\n' )
fprintf( 'amplitude frequency is period = 1 / 0.15 = a 6.6667 period sine wave, which rounds up\n' )
fprintf( 'to a period of 7, both of which are known to be the "correct" periods in the\n' )
fprintf( 'combined waveform. However, we can also see some artifacts in this frequency plot\n' )
fprintf( 'due to "spectral leakage." Now, let us do some filtering and see the plot.\n\n' )
fprintf( 'Press enter to continue.\n\n' )
pause ;

% The actual frequency filtering will be performed on the basic FFT output, remembering that: 

% if the number of time domain samples, N, is even:

% element 1 = constant or DC amplitude
% elements 2 to N/2 = amplitudes of positive frequency components in increasing order
% elements N to N/2 = amplitudes of negative frequency components in decreasing order

% Note that element N/2 is the algebraic sum of the highest positive and highest
% negative frequencies and, for real time domain signals, the imaginary
% components cancel and N/2 is always real. The first sinusoidal component
% (fundamental) has it's positive component in element 2 and its negative
% component in element N.

% If the number of time domain samples, N, is odd:

% element 1 = constant or DC amplitude
% elements 2 to (N+1)/2 = amplitudes of positive frequency components in increasing order
% elements N to ((N+1)/2 + 1) = amplitudes of negative frequency components in decreasing order

% Note that for an odd number of samples there is no "middle" element and all
% the frequency domain amplitudes will, in general, be complex. 

if mod( N , 2 ) == 0 % N even
% The DC component is unique and should not be altered
% adjust value of element N / 2 to compensate for it being the sum of frequencies
data_in_freq_domain( N / 2 ) = abs( data_in_freq_domain( N / 2 ) ) / 2 ;

[ max1 , ix1 ] = max( abs( data_in_freq_domain( 2 : N / 2 ) ) ) ; % get the index of the first max value
ix1 = ix1 + 1 ;

[ max2 , ix2 ] = max( abs( data_in_freq_domain( N / 2 + 1 : end ) ) ) ; % get index of the second max value
ix2 = N / 2 + ix2 ;

non_zero_values_ix = [ 1 ix1 ix2 ] ; % indices for DC component, +ve freq max & -ve freq max

else % N odd
% The DC component is unique and should not be altered

[ max1 , ix1 ] = max( abs( data_in_freq_domain( 2 : ( N + 1 ) / 2 ) ) ) ; % get the index of the first max value
ix1 = ix1 + 1 ;

[ max2 , ix2 ] = max( abs( data_in_freq_domain( ( N + 1 ) / 2 + 1 : end ) ) ) ; % get index of the second max value
ix2 = ( N + 1 ) / 2 + ix2 ;

non_zero_values_ix = [ 1 ix1 ix2 ] ; % indices for DC component, +ve freq max & -ve freq max

end

% extract the constant term and frequencies of interest
non_zero_values = data_in_freq_domain( non_zero_values_ix ) ;

% now set all values to zero
data_in_freq_domain( 1 : end ) = 0.0 ;

% replace the non_zero_values
data_in_freq_domain( non_zero_values_ix ) = non_zero_values ;

composite_sine_inv = ifft( data_in_freq_domain ) ; % reverse the FFT

recovered_underlying_sine = real( composite_sine_inv ) ; % recover the real component of ifft for plotting

plot( underlying_sine , 'r' , composite_sine , 'g' , recovered_underlying_sine , 'c' )
legend( 'Underlying sine period 20' , 'Composite Sine' , 'Recovered underlying sine' )
title( 'Recovery of Underlying Dominant Sine Wave from Composite Data via FFT filtering' )

fprintf( 'Here we can see that the period 7 sine wave has been removed\n' )
fprintf( 'The trick to smoothing in the frequency domain is to eliminate\n' )
fprintf( '"unwanted frequencies" by setting them to zero in the frequency\n' )
fprintf( 'vector. For the purposes of illustration here the 7 period sine wave\n' )
fprintf( 'is considered noise. After doing this, we reverse the FFT and plot\n' )
fprintf( 'the results. The code that does all this can be seen in the code\n' )
fprintf( 'file. Now let us repeat this with noise.\n\n' )
fprintf( 'Press enter to continue.\n\n' )
pause ;

noise = randn( period , 1 )' * 0.25 ; % create random noise
noisy_composite = composite_sine .+ noise ;

plot( underlying_sine , 'r' , underlying_sine_2 , 'm' , noise, 'k' , noisy_composite , 'g' )
legend( 'Underlying sine period 20' , 'Underlying sine period 7' , 'random noise' , 'Combined waveform' )
title( 'Two Sine Waves of periods 20 and 7, with Amplitudes of 2 and 0.75 respectively, combined with random noise' ) 
fprintf( 'After viewing chart, press enter to continue.\n\n' )
pause ;

fprintf( 'Running the previous code again on this combined waveform with noise gives\n' )
fprintf( 'this centered plot.\n\n' )

data_in_freq_domain = fft( noisy_composite ) ; % apply the FFT
centered_data_in_freq_domain = fftshift( data_in_freq_domain / N ) ; % normalises & shifts the fft data so that it is centered

% remember to take the abs of data_in_freq_domain to get the magnitude!
stem( frequencyRange , abs( centered_data_in_freq_domain ) ) ;
xlabel( 'Freq (Hz)' )
ylabel( 'Amplitude' )
title( 'Using the centered FFT function with Amplitude normalisation on 2 Sine Waves plus random noise' )
grid
axis( [ min(frequencyRange) - 0.1 , max(frequencyRange) + 0.1 , 0 , max(abs( centered_data_in_freq_domain )) + 0.1 ] )

fprintf( 'Here we can see that the addition of the noise has introduced\n' )
fprintf( 'some unwanted frequencies to the frequency plot. The trick to\n' )
fprintf( 'smoothing in the frequency domain is to eliminate these "unwanted"\n' )
fprintf( 'frequencies by setting them to zero in the frequency vector that is\n' )
fprintf( 'shown in this plot. For the purposes of illustration here,\n' )
fprintf( 'all frequencies other than that for the 20 period sine wave are\n' )
fprintf( 'considered noise. After doing this, we reverse the FFT and plot\n' )
fprintf( 'the results. The code that does all this can be seen in the code\n' )
fprintf( 'file.\n\n' )
fprintf( 'Press enter to view the plotted result.\n\n' )
pause ;

% The actual frequency filtering will be performed on the basic FFT output, remembering that: 

% if the number of time domain samples, N, is even:

% element 1 = constant or DC amplitude
% elements 2 to N/2 = amplitudes of positive frequency components in increasing order
% elements N to N/2 = amplitudes of negative frequency components in decreasing order

% Note that element N/2 is the algebraic sum of the highest positive and highest
% negative frequencies and, for real time domain signals, the imaginary
% components cancel and N/2 is always real. The first sinusoidal component
% (fundamental) has it's positive component in element 2 and its negative
% component in element N.

% If the number of time domain samples, N, is odd:

% element 1 = constant or DC amplitude
% elements 2 to (N+1)/2 = amplitudes of positive frequency components in increasing order
% elements N to ((N+1)/2 + 1) = amplitudes of negative frequency components in decreasing order

% Note that for an odd number of samples there is no "middle" element and all
% the frequency domain amplitudes will, in general, be complex. 

if mod( N , 2 ) == 0 % N even
% The DC component is unique and should not be altered
% adjust value of element N / 2 to compensate for it being the sum of frequencies
data_in_freq_domain( N / 2 ) = abs( data_in_freq_domain( N / 2 ) ) / 2 ;

[ max1 , ix1 ] = max( abs( data_in_freq_domain( 2 : N / 2 ) ) ) ; % get the index of the first max value
ix1 = ix1 + 1 ;

[ max2 , ix2 ] = max( abs( data_in_freq_domain( N / 2 + 1 : end ) ) ) ; % get index of the second max value
ix2 = N / 2 + ix2 ;

non_zero_values_ix = [ 1 ix1 ix2 ] ; % indices for DC component, +ve freq max & -ve freq max

else % N odd
% The DC component is unique and should not be altered

[ max1 , ix1 ] = max( abs( data_in_freq_domain( 2 : ( N + 1 ) / 2 ) ) ) ; % get the index of the first max value
ix1 = ix1 + 1 ;

[ max2 , ix2 ] = max( abs( data_in_freq_domain( ( N + 1 ) / 2 + 1 : end ) ) ) ; % get index of the second max value
ix2 = ( N + 1 ) / 2 + ix2 ;

non_zero_values_ix = [ 1 ix1 ix2 ] ; % indices for DC component, +ve freq max & -ve freq max

end

% extract the constant term and frequencies of interest
non_zero_values = data_in_freq_domain( non_zero_values_ix ) ;

% now set all values to zero
data_in_freq_domain( 1 : end ) = 0.0 ;

% replace the non_zero_values
data_in_freq_domain( non_zero_values_ix ) = non_zero_values ;

noisy_composite_inv = ifft( data_in_freq_domain ) ; % reverse the FFT

recovered_underlying_sine = real( noisy_composite_inv ) ; % recover the real component of ifft for plotting

plot( underlying_sine , 'r' , noisy_composite , 'g' , recovered_underlying_sine , 'c' )
legend( 'Underlying sine period 20' , 'Noisy Composite' , 'Recovered underlying sine' )
title( 'Recovery of Underlying Dominant Sine Wave from Noisy Composite Data via FFT filtering' )

fprintf( 'As can be seen, the underlying sine wave of period 20 is almost completely\n' )
fprintf( 'recovered using frequency domain filtering. The fact that it is not perfectly\n' )
fprintf( 'recovered is due to artifacts (spectral leakage) intoduced into the frequency spectrum\n' )
fprintf( 'because the individual noisy composite signal components do not begin and end\n' )
fprintf( 'with zero values, or there are not integral numbers of these components present\n' )
fprintf( 'in the sampled, noisy composite signal.\n\n' )
fprintf( 'So there we have it - a basic introduction to using the FFT for signal smoothing\n' )
fprintf( 'in the frequency domain.\n\n' )

This script is liberally commented and has copious output to the terminal via fprintf statements and produces a series of plots of time series and their frequency chart plots. I would encourage readers to run this script for themselves and, if so inclined, give feedback.

The last chart output of the script will look something like this.

The red sinusoid represents an underlying periodic signal of period 20 and the green time series is a composite signal of the red line plus a 7 period sinusoid of less amplitude and random noise, these not being shown in the plot. The blue line is the resultant output of FFT filtering in the frequency domain of the green input. I must say I'm pretty pleased with this result and will be doing more work in this area shortly. Stay tuned!

Monday, 15 July 2013

My NN Input Tweak

As I said in my previous post, I wanted to come up with a "principled" method of transforming market data into inputs more suitable for my classification NN, and this post was going to be about how I have investigated various bandpass filters, Fast Fourier transform filters set to eliminate cycles less than the dominant cycle in the data, FIR filters, and a form of the Hilbert-Huang transform. However, while looking to link to another site I happened upon a so-called "Super smoother" filter, but more on this later.

Firstly, I would like to explain how I added realistic "noise" to my normal, idealised market data. Some time ago I looked at, and then abandoned, the idea of using an AFIRMA smoothing algorithm, an example of which is shown below:

The cyan line is the actual data to be smoothed and the red line is an 11 bar, Blackman window function AFIRMA smooth of this data. If one imagines this red smooth to be one realisation of my idealised market data, i.e. the real underlying signal, then the blue data is the signal plus noise. I have taken the measure of the amount of noise present to be the difference between the red and blue lines normalised by the standard deviation of Bollinger bands applied to the red AFIRMA smooth. I did this over all the historical data I have and combined it all into one noise distribution file. When creating my idealised data it is then a simple matter to apply a Bollinger band to it and add un-normalised random samples from this file as noise to create a realistically noisy, idealised time series for testing purposes.

Now for the "super smoother" filter. Details are freely available in the white paper available from here and my Octave C++ .oct file implementation is given in the code box below.

#include octave/oct.h
#include octave/dcolvector.h
#include math.h
#define PI 3.14159265

DEFUN_DLD ( super_smoother, args, nargout,
  "-*- texinfo -*-\n\
     @deftypefn {Function File} {} super_smoother (@var{n})\n\
     This function smooths the input column vector by using Elher's\n\
     super smoother algorithm set for a 10 bar critical period.\n\
     @end deftypefn" )
{
  octave_value_list retval_list ;
  int nargin = args.length () ;

// check the input arguments
    if ( nargin != 1 )
    {
        error ("Invalid argument. Argument is a column vector of price") ;
        return retval_list ;
    }

    if (args(0).length () < 7 )
    {
        error ("Invalid argument. Argument is a column vector of price") ;
        return retval_list ;
    }

    if (error_state)
    {
        error ("Invalid argument. Argument is a column vector of price") ;
        return retval_list ;
    }
// end of input checking

  ColumnVector input = args(0).column_vector_value () ;
  ColumnVector output = args(0).column_vector_value () ;
  
  // Declare variables
  double a1 = exp( -1.414 * PI / 10.0 ) ;
  double b1 = 2.0 * a1 * cos( (1.414*180.0/10.0) * PI / 180.0 ) ;
  double c2 = b1 ;
  double c3 = -a1 * a1 ;
  double c1 = 1.0 - c2 - c3 ;
 
  // main loop
  for ( octave_idx_type ii (2) ; ii < args(0).length () ; ii++ )
      {
      output(ii) = c1 * ( input(ii) + input(ii-1) ) / 2.0 + c2 * output(ii-1) + c3 * output(ii-2) ; 
      } // end of for loop
     
  retval_list(0) = output ;

  return retval_list ;
  
} // end of function

I like this filter because it smooths away all cycles below the 10 bar critical period, and long time readers of this blog may recall from this post that there are no dominant cycle periods in my EOD data that are this short. The filter produces very nice, smooth curves which act as a good model for the idealised market, smooth curves my NN system was trained on. A typical result on synthetic data is shown below

where the red line is a typical example of "true underlying" NN training data, the cyan is the red plus realistic noise as described above, and the yellow is the final "super smoother" filter of the cyan "price". There is a bit of lag in this filter but I'm not unduly concerned with this as it is intended as input to the classification NN, not the market timing NN. Below is a short film of the performance of this filter on the last two years or so of the ES mini S&P contract. The cyan line is the raw input and the red line is the super smoother output.

Non-embedded view here.

Saturday, 30 June 2012

Machine Learning Course Completed

I'm pleased to say that I have now completed Andrew Ng's machine learning course, which is offered through Coursera. This post is not intended to be a review of the course, which in my opinion is extremely good and very useful, but more of a reflection of my thoughts and what I think will be useful for me personally.

Firstly, I was pleasantly surprised that the software/programming language of instruction was Octave, which regular readers of this blog will know is my main software of choice. Apart from learning the concepts of ML, I also picked up some handy tips for Octave programming, and more importantly for me I now have a set of working Octave ML functions that I can use immediately in my system development.

In my previous post I mentioned that my first attempt at using ML will be to use a Neural Net to classify market types. As background to this, readers might be interested in a pdf file of the video lectures, available from here, which was put together and posted on the course discussion forum by another student - I think this is very good and all credit to said student, José Soares Augusto.

Due to the honour code ( or honor code for American readers ) of the course I will be unable to post the code that I wrote for the programming assignments. However, I do feel that I can post the code shown in the code box below, as the copyright notice allows it. A few slight changes I made are noted in the copyright notice. This is a minimisation function that was used in the training of the Neural Net assignment and was provided in the assignment download.

function [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)
% Minimize a continuous differentialble multivariate function. Starting point
% is given by "X" (D by 1), and the function named in the string "f", must
% return a function value and a vector of partial derivatives. The Polack-
% Ribiere flavour of conjugate gradients is used to compute search directions,
% and a line search using quadratic and cubic polynomial approximations and the
% Wolfe-Powell stopping criteria is used together with the slope ratio method
% for guessing initial step sizes. Additionally a bunch of checks are made to
% make sure that exploration is taking place and that extrapolation will not
% be unboundedly large. The "length" gives the length of the run: if it is
% positive, it gives the maximum number of line searches, if negative its
% absolute gives the maximum allowed number of function evaluations. You can
% (optionally) give "length" a second component, which will indicate the
% reduction in function value to be expected in the first line-search (defaults
% to 1.0). The function returns when either its length is up, or if no further
% progress can be made (ie, we are at a minimum, or so close that due to
% numerical problems, we cannot get any closer). If the function terminates
% within a few iterations, it could be an indication that the function value
% and derivatives are not consistent (ie, there may be a bug in the
% implementation of your "f" function). The function returns the found
% solution "X", a vector of function values "fX" indicating the progress made
% and "i" the number of iterations (line searches or function evaluations,
% depending on the sign of "length") used.
%
% Usage: [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)
%
% See also: checkgrad 
%
% Copyright (C) 2001 and 2002 by Carl Edward Rasmussen. Date 2002-02-13
%
% (C) Copyright 1999, 2000 & 2001, Carl Edward Rasmussen
% 
% Permission is granted for anyone to copy, use, or modify these
% programs and accompanying documents for purposes of research or
% education, provided this copyright notice is retained, and note is
% made of any changes that have been made.
% 
% These programs and documents are distributed without any warranty,
% express or implied.  As the programs were written for research
% purposes only, they have not been tested to the degree that would be
% advisable in any important application.  All use of these programs is
% entirely at the user's own risk.
%
% [ml-class] Changes Made:
% 1) Function name and argument specifications
% 2) Output display
%
% Dekalog Changes Made:
% Some lines have been altered, changing | to || and & to &&.
% This is to avoid "possible Matlab-style short-circuit operator" warnings
% being given when code is run under Octave. The lines where these changes
% have been made are indicated by comments at the end of each respective line. 

% Read options
if exist('options', 'var') && ~isempty(options) && isfield(options, 'MaxIter')
    length = options.MaxIter;
else
    length = 100;
end


RHO = 0.01;                            % a bunch of constants for line searches
SIG = 0.5;       % RHO and SIG are the constants in the Wolfe-Powell conditions
INT = 0.1;    % don't reevaluate within 0.1 of the limit of the current bracket
EXT = 3.0;                    % extrapolate maximum 3 times the current bracket
MAX = 20;                         % max 20 function evaluations per line search
RATIO = 100;                                      % maximum allowed slope ratio

argstr = ['feval(f, X'];                      % compose string used to call function
for i = 1:(nargin - 3)
  argstr = [argstr, ',P', int2str(i)];
end
argstr = [argstr, ')'];

if max(size(length)) == 2, red=length(2); length=length(1); else red=1; end
S=['Iteration '];

i = 0;                                            % zero the run length counter
ls_failed = 0;                             % no previous line search has failed
fX = [];
[f1 df1] = eval(argstr);                      % get function value and gradient
i = i + (length<0);                                            % count epochs?!
s = -df1;                                        % search direction is steepest
d1 = -s'*s;                                                 % this is the slope
z1 = red/(1-d1);                                  % initial step is red/(|s|+1)

while i < abs(length)                                      % while not finished
  i = i + (length>0);                                      % count iterations?!

  X0 = X; f0 = f1; df0 = df1;                   % make a copy of current values
  X = X + z1*s;                                             % begin line search
  [f2 df2] = eval(argstr);
  i = i + (length<0);                                          % count epochs?!
  d2 = df2'*s;
  f3 = f1; d3 = d1; z3 = -z1;             % initialize point 3 equal to point 1
  if length>0, M = MAX; else M = min(MAX, -length-i); end
  success = 0; limit = -1;                     % initialize quanteties
  while 1
    while ((f2 > f1+z1*RHO*d1) || (d2 > -SIG*d1)) && (M > 0) % | and & changed to || and && to avoid "possible Matlab-style short-circuit operator" warning 
      limit = z1;                                         % tighten the bracket
      if f2 > f1
        z2 = z3 - (0.5*d3*z3*z3)/(d3*z3+f2-f3);                 % quadratic fit
      else
        A = 6*(f2-f3)/z3+3*(d2+d3);                                 % cubic fit
        B = 3*(f3-f2)-z3*(d3+2*d2);
        z2 = (sqrt(B*B-A*d2*z3*z3)-B)/A;       % numerical error possible - ok!
      end
      if isnan(z2) || isinf(z2)     % | changed to || to avoid "possible Matlab-style short-circuit operator" warning 
        z2 = z3/2;                  % if we had a numerical problem then bisect
      end
      z2 = max(min(z2, INT*z3),(1-INT)*z3);  % don't accept too close to limits
      z1 = z1 + z2;                                           % update the step
      X = X + z2*s;
      [f2 df2] = eval(argstr);
      M = M - 1; i = i + (length<0);                           % count epochs?!
      d2 = df2'*s;
      z3 = z3-z2;                    % z3 is now relative to the location of z2
    end
    if f2 > f1+z1*RHO*d1 || d2 > -SIG*d1                    % | changed to || to avoid "possible Matlab-style short-circuit operator" warning
      break;                                                % this is a failure
    elseif d2 > SIG*d1
      success = 1; break;                                             % success
    elseif M == 0
      break;                                                          % failure
    end
    A = 6*(f2-f3)/z3+3*(d2+d3);                      % make cubic extrapolation
    B = 3*(f3-f2)-z3*(d3+2*d2);
    z2 = -d2*z3*z3/(B+sqrt(B*B-A*d2*z3*z3));        % num. error possible - ok!
    if ~isreal(z2) || isnan(z2) || isinf(z2) || z2 < 0   % num prob or wrong sign? % | changed to || to avoid "possible Matlab-style short-circuit operator" warning
      if limit < -0.5                               % if we have no upper limit
        z2 = z1 * (EXT-1);                 % the extrapolate the maximum amount
      else
        z2 = (limit-z1)/2;                                   % otherwise bisect
      end
    elseif (limit > -0.5) && (z2+z1 > limit)       % extraplation beyond max?   % & changed to && to avoid "possible Matlab-style short-circuit operator" warning
      z2 = (limit-z1)/2;                                               % bisect
    elseif (limit < -0.5) && (z2+z1 > z1*EXT)      % extrapolation beyond limit % & changed to && to avoid "possible Matlab-style short-circuit operator" warning
      z2 = z1*(EXT-1.0);                           % set to extrapolation limit
    elseif z2 < -z3*INT
      z2 = -z3*INT;
    elseif (limit > -0.5) && (z2 < (limit-z1)*(1.0-INT))   % too close to limit? % & changed to && to avoid "possible Matlab-style short-circuit operator" warning
      z2 = (limit-z1)*(1.0-INT);
    end
    f3 = f2; d3 = d2; z3 = -z2;                  % set point 3 equal to point 2
    z1 = z1 + z2; X = X + z2*s;                      % update current estimates
    [f2 df2] = eval(argstr);
    M = M - 1; i = i + (length<0);                             % count epochs?!
    d2 = df2'*s;
  end                                                      % end of line search

  if success                                         % if line search succeeded
    f1 = f2; fX = [fX' f1]';
    fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);
    s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2;      % Polack-Ribiere direction
    tmp = df1; df1 = df2; df2 = tmp;                         % swap derivatives
    d2 = df1'*s;
    if d2 > 0                                      % new slope must be negative
      s = -df1;                              % otherwise use steepest direction
      d2 = -s'*s;    
    end
    z1 = z1 * min(RATIO, d1/(d2-realmin));          % slope ratio but max RATIO
    d1 = d2;
    ls_failed = 0;                              % this line search did not fail
  else
    X = X0; f1 = f0; df1 = df0;  % restore point from before failed line search
    if ls_failed || i > abs(length)         % line search failed twice in a row % | changed to || to avoid "possible Matlab-style short-circuit operator" warning
      break;                             % or we ran out of time, so we give up
    end
    tmp = df1; df1 = df2; df2 = tmp;                         % swap derivatives
    s = -df1;                                                    % try steepest
    d1 = -s'*s;
    z1 = 1/(1-d1);                     
    ls_failed = 1;                                    % this line search failed
  end
  if exist('OCTAVE_VERSION')
    fflush(stdout);
  end
end
fprintf('\n');

Finally, the last set of videos talked about "Artificial Data Synthesis," otherwise known as creating your own data for training purposes. This is basically what I had planned to do anyway ( see previous post ), but it is nice to learn that it is standard, accepted practice in the ML world. The first such way of creating data, in the context of Photo OCR, is shown below

where various font libraries are used against random backgrounds. I think this very much mirrors my planned approach of training on repeated sets of my "ideal time series" construct. However, another approach which could be used is "data distortion," shown in this next image

which is an approach that my creating synthetic data using FFT might be useful for, or alternatively a correlation and cointegration approach as shown in R code in this Quantitative Finance thread.

All in all, I'm quite excited by the possibilities of my new found knowledge, and I fully expect that in time, after development and testing, any Neural Net I develop will in fact replace my current Naive Bayesian classifier.

Tuesday, 3 January 2012

Creating Synthetic Data Using the Fast Fourier Transform

This question was recently asked on the Quantitative Finance forum and one answerer's link to here piqued my interest. My comments to said answer were
"Interesting link! I theorise that one way it could be done is to apply a Fast Fourier Transform (e.g. FFT function in Octave/MATLAB) to the original series, then divide the resulting vector into segments and within each separate segment randomly permute the values, and finally reconstruct the time series by applying the Inverse Fast Fourier Transform (IFFT function). By randomly permuting the values you will not alter the amplitudes of the composite sine waves but will slightly alter their periodicity - this will naturally tend to restrict the range of the new series to that of the original."
and
"This is such an intriguing approach that I may "knock up" some Octave code on my blog and post a link to it from here."
This blog post is my attempt to "knock up" the Octave code.

The "guts" of the code are a .oct function that randomly permutes the FFT output vector that is given to it as its only input argument and the C++ code for this .oct function is given below.
Note: due to formatting issues the headers are not correctly displayed; the octave headers should be enclosed between a "<" and ">" after the #include.

#include octave/oct.h
#include octave/dColVector.h
#include octave/CNDArray.h
#include "MersenneTwister.h"
     
DEFUN_DLD (permute_vector, args, , "Input is a vector that is to be permuted once")
{
octave_value_list retval;

int nargin = args.length () ;
int vec_length = args(0).length () ;

// check the input arguments
    if ( nargin > 1 )
    {
        error ("Invalid arguments. Input is a vector that is to be permuted once") ;
        return retval ;
    }
    
    if (vec_length < 2 )
    {
        error ("Invalid arguments. Input is a vector that is to be permuted once") ;
        return retval ;
    }
    
    if (error_state)
    {
        error ("Invalid arguments. Input is a vector that is to be permuted once") ;
        return retval ;
    }
// end of input checking

 ComplexNDArray input_vector = args(0).complex_array_value () ;
 ComplexNDArray output_vector = args(0).complex_array_value () ;
 int k1;
 int k2;
 
 MTRand mtrand1; // Declare the Mersenne Twister Class - will seed from system time

       k1 = vec_length - 1; // initialise prior to shuffling the vector

       while (k1 > 0) // While at least 2 left to shuffle
             {          
             k2 = mtrand1.randInt( k1 ); // Pick an int from 0 through k1 

               if (k2 > k1) // check if random vector index no. k2 is > than max vector index - should never happen 
                  {
                  k2 = k1 - 1; // But this is cheap insurance against disaster if it does happen
                  }

             output_vector(k1) = input_vector(k2) ; // allocate random pick k2 from input_vector to the k1 vector index of output_vector
             input_vector(k2) = input_vector(k1) ; // replace random pick k2 content of input_vector with content k1 of input_vector
             k1 = k1 - 1; // count down 
             } // Shuffling is complete when this while loop exits

 retval(0) = output_vector ; 
 
return retval; // Return the output to Octave
}

The Octave script that calls the above function is

clear all

contract = input( "Enter contract symbol e.g. sp: ","s") ;
data = load("-ascii",contract) ;
n = rows(data)
index_begin = input( "Enter index_begin: ") ;
index_end = input( "Enter index_end, value not greater than n: ") ;

close = data(index_begin:index_end,7) ;

% detrend the close vector prior to applying the fft
slope = ( close(end) - close(1) ) / ( length(close) - 1 ) ;
v1 = (0:1:length(close)-1)' ;
detrended_close = close .- ( v1 .* slope ) ;
close_index_begin = close(1)
detrended_close_index_begin = detrended_close(1)
detrended_close_index_end = detrended_close(end)

% create zero padded vector for fft
L2 = 2^nextpow2( length(close) ) ; half_L2 = L2/2 ;
y2 = zeros( 1,L2 ) ; y2( 1:length(close) ) = detrended_close ;

% apply the fft
transform = fft( y2 ) ;

% permute the first half of the transform vector in "chunks" of 10
max_ii_value = floor( half_L2 / 10 ) ;
for ii = 1:max_ii_value
transform( (ii*10):(ii*10)+9 ) = permute_vector( transform( (ii*10):(ii*10)+9 ) ) ;
endfor

% take the inverse fft
ifft_vec = real( ifft( transform ) ) ;

% retrend the ifft_permuted_vec
retrended_ifft_vec = ifft_vec( 1:length(close) )' .+ ( v1 .* slope ) ;

% statistics
correl = corrcoef (close, retrended_ifft_vec)
spear = spearman (close, retrended_ifft_vec)
tau = kendall (close, retrended_ifft_vec)

plot( close,'b',retrended_ifft_vec,'r' ) ; legend( 'close','permute' ) ;

This script could undoubtedly be made more efficient by putting the for loop on lines 28 to 30 within the .oct function itself, but this is just a proof of concept. The script allows one to extract any contiguous section of data by use of the index_begin and index_end inputs or, of course, to select the whole data range. Some simple statistics are calculated using in-built Octave functions and can be seen in the terminal windows in the screen shots. Finally the script plots the original input data along with the FFT permuted output data. Typical screen shots are shown below, with the original data in blue and the FFT permuted data in red.

All S&P data

The last few years of the above

The same last few years indexed using the index_begin and index_end method to apply the function just to these years and not the whole data set

The close_index_begin, detrended_close_index_begin and detrended_close_index_end values also seen in the terminal windows are a simple sanity check to see that the original data has been detrended correctly prior to performing the FFT.

Pages