Friday 17 September 2021

Matrix Profile and Weakly Labelled Data - Update 1

This is the first post in a short series detailing my recent work following on from my previous post. This post will be about some problems I have had and how I partially solved them.

The main problem was simply the speed at which the code (available from the companion website) seems to run. The first stage Matrix Profile code runs in a few seconds, the second, individual evaluation stage in no more than a few minutes, but the third stage, greedy search, which uses Golden Section Search over the pattern candidates, can take many, many hours. My approach to this was simply to optimise the code to the best of my ability. My optimisations, all in the compute_f_meas.m function, are shown in the following code boxes. This while loop

i = 1;
while true
    
 if i >= length(anno_st)
    break;
 endif
       
first_part = anno_st(1:i);
second_part = anno_st(i+1:end);
bad_st = abs(second_part - anno_st(i)) < sub_len;
second_part = second_part(~bad_st);
anno_st = [first_part; second_part;];
i = i + 1;
      
endwhile
is replaced by this .oct compiled version of the same while loop
#include 
#include 

DEFUN_DLD ( stds_f_meas_while_loop_replace, args, nargout,
"-*- texinfo -*-\n\
@deftypefn {Function File} {} stds_f_meas_while_loop_replace (@var{input_vector,sublen})\n\
This function takes an input vector and a scalar sublen\n\
length. The function sets to zero those elements in the\n\
input vector that are closer to the preceeding value than\n\
sublen. This function replaces a time consuming .m while loop\n\
in the stds compute_f_meas.m function.\n\
@end deftypefn" )

{
octave_value_list retval_list ;
int nargin = args.length () ;

// check the input arguments
if ( nargin != 2 ) // there must be a vector and a scalar sublen
   {
   error ("Invalid arguments. Inputs are a column vector and a scalar value sublen.") ;
   return retval_list ;
   }

if ( args(0).length () < 2 )
   {
   error ("Invalid 1st argument length. Input is a column vector of length > 1.") ;
   return retval_list ;
   }
   
if ( args(1).length () > 1 )
   {
   error ("Invalid 2nd argument length. Input is a scalar value for sublen.") ;
   return retval_list ;
   }
// end of input checking  
  
ColumnVector input = args(0).column_vector_value () ;
double sublen = args(1).double_value () ;
double last_iter ;

// initialise last_iter value
last_iter = input( 0 ) ;
     
for ( octave_idx_type ii ( 1 ) ; ii < args(0).length () ; ii++ )
    {
    
      if ( input( ii ) - last_iter >= sublen )
      {
        last_iter = input( ii ) ;
      }
      else
      {
        input( ii ) = 0.0 ;
      }
      
    } // end for loop
   
retval_list( 0 ) = input ;

return retval_list ;

} // end of function
and called thus
anno_st = stds_f_meas_while_loop_replace( anno_st , sub_len ) ;
anno_st( anno_st == 0 ) = [] ;
This for loop
is_tp = false(length(anno_st), 1);
for i = 1:length(anno_st)
    if anno_ed(i) > length(label)
        anno_ed(i) = length(label);
    end
    if sum(label(anno_st(i):anno_ed(i))) > 0.8*sub_len
        is_tp(i) = true;
    end
end
tp_pre = sum(is_tp);
is replaced by use of cellslices.m and cellfun.m thus
label_length = length( label ) ;
anno_ed( anno_ed > label_length ) = label_length ;
cell_slices = cellslices( label , anno_st , anno_ed ) ;
cell_sums = cellfun( @sum , cell_slices ) ;
tp_pre = sum( cell_sums > 0.8 * sub_len ) ;
and a further for loop
is_tp = false(length(pos_st), 1);
for i = 1:length(pos_st)
    if sum(anno(pos_st(i):pos_ed(i))) > 0.8*sub_len
        is_tp(i) = true;
    end
end
tp_rec = sum(is_tp);
is replaced by
cell_slices = cellslices( anno , pos_st , pos_ed ) ;
cell_sums = cellfun( @sum , cell_slices ) ;
tp_rec = sum( cell_sums > 0.8 * sub_len ) ;

Although the above measurably improves running times, overall the code of the third stage is still sluggish. I have found that the best way to deal with this, on the advice of the original paper's author, is to limit the number of patterns to search for, the "pat_max" variable, to the minimum possible to achieve a satisfactory result. What I mean by this is that if  pat_max = 5 and the result returned also has 5 identified patterns, incrementally increase pat_max until such time that the number of identified patterns is less than pat_max. This does, by necessity, mean running the whole routine a few times, but it is still quicker this way than drastically over estimating pat_max, i.e. choosing a value of say 50 to finally identify maybe only 5/6 patterns.

More in due course.

No comments: