3.4 Identifying Hidden Periodicities: Approach and Examples

The following flowchart, which also appears in the preliminary pages of this dissertation, suggests a strategy for analyzing a data set. This section contains examples illustrating the application of the procedure presented here.

flowchart for identifying hidden periodicities, top part flowchart for identifying hidden periodicities, bottom part
Missing values in a data set;  NaN , Not A Number

Data sets sometimes have missing values. Such entries can be represented via the MATLAB matrix entry  NaN  (‘Not a Number’). However, any arithmetic calculation using a  NaN  yields  NaN  as the final result. Therefore,  NaN  entries must be removed or replaced by interpolated values before any processing can be done on a data set.

The following MATLAB function can be used to remove ordered pairs of the form  (t,NaN)  from a data set.

MATLAB FUNCTION delnan

The MATLAB function  delnan  (‘delete not a number’), with source code given below, removes the  NaN  values from a data set  D . To use the function, type:

Dr = delnan(D);

INPUT and OUTPUT

The input is a matrix  D ; the first column of  D  contains the time values of data points, and the second column contains the corresponding data values. Thus, each row of  D  is a data point.

The output is a matrix  Dr  that is identical to  D , except that rows of the form  (t,NaN)  have been removed.

Source code for  delnan
the function delnan
Octave requires  endfunction  as the last line.

The use of  delnan  is illustrated in the next example.

EXAMPLE 1

The data set graphed below gives the total daily attendance in three undergraduate classes taught by the author during the Fall 1993 semester at Idaho State University. Two sections of Math 111 (Algebra) and one section of Math 120 (Calculus for the Life and Social Sciences) were taught.

Each class met four days per week: Monday, Tuesday, Wednesday, and Friday. The data values recorded as  NaN  represent days when classes did not meet.

This data set is stored in a MATLAB matrix  D . The symbols M, T, W, F (the days of the week that classes met) are shown below only for the reader's information, and are not included in the matrix  D .

data set for total daily attendance in three undergraduate classes graph of total daily attendance
Remove  NAN's from the data set;
Dr
tr
yr

First, the MATLAB function  delnan  is used to remove the  NAN  entries from  D . The resulting matrix is named  Dr  (‘r’ for ‘removed’). The first column of  Dr  is named  tr , and the second column is named  yr .

Observe that row $\,9\,$ of  D , which contained a  NaN  entry, has been removed.

delete NaN entries
Fit  Dr  with a parabola

The ‘downward trend’ in the data is quite striking, so linear least-squares approximation techniques are used first to fit the matrix  Dr = [tr yr]  with a parabola. The parabola component obtained is named  y1 :

fit data with a parabola
Octave requires:
f1 = ones(size(tr));
downward trend in the data set
Subtract parabola; analyze the remainder  yrem

The parabola component  y1  is subtracted from  yr ; the difference is named  yrem . The list  yrem  is tested for random behavior using the Turning Point Test:

apply the turning point test parabola component removed

The information given in  rand  shows that if  yrem  were generated by random means alone, then one would expect to see about $\,38.7\,$ turning points. There are actually $\,43\,$ turning points in  yrem . The probability that $\,43\,$ or more turning points would occur, should the data be truly random, is about $\,53.2\%\,.$ A decision is made to search for sinusoidal components.

Use spline interpolation

There are no obvious sinusoidal components in  yrem . The periodogram is obtained as an initial data analysis tool. Since a uniform time list is required to find the periodogram, spline interpolation is used to interpolate the data set  [tr yrem] . Then, the periodogram of the interpolated data set is computed. Both the interpolated data set and its periodogram are graphed below.

spline interpolation; get periodogram spline interpolated data set periodogram for spline interpolated data set
Using a genetic algorithm

A genetic algorithm is used to search for sinusoidal components in the range of periods from $\,2\,$ to $\,63\,.$

A good fit is found using sinusoids with periods $\,14.5\,,$ $\,5.5\,,$ and $\,32.5\,.$ Selected commands used in the fitting process are shown below. Also, the graph of the best fit is given.

a genetic algorithm to search for sinusoidal components plotting a best fit from the genetic algorithm best fit to  yrem  from genetic algorithm
Using gradient methods to improve the fit

Gradient methods are used to improve the fit. After two iterations, a ‘best’ fit is found, with approximate periods $\,5.7\,,$ $\,33.0\,,$ and $\,13.6\,.$ Selected commands used are shown below. The graph of the best fit is given, as well as the graphs of each of the three sinusoidal components.

These three components have a reasonable interpretation: period $\,5.7\,$ is slightly longer than a weekly component; period $\,13.6\,$ represents the time between exams in the course (this component peaks at the exam dates); and period $\,34.5\,$ is approximately half the semester (this component peaks at mid-semester).

improving fit with gradient methods
best fit from gradient methods period 5.7 component period 13.6 component period 33 component

The sum of the parabola component and the three sinusoidal components are shown with the original data:

parabola plus sinusoidal components with original data
EXAMPLE 2

The data set graphed below gives the daily balance in a checking account from $\,1/14/92\,$ to $\,12/31/93\,.$ Both a ‘point’ graph and a ‘line’ graph of the data are shown.

There are several trends in the data, due to different types of employment for one contributor to the account:

These three periods are roughly delineated in the first graph below, by the dashed vertical lines.

The first column of  dlybal  is named  t , and the second column is named  y .

daily balance in a checking account for almost two years
daily balance point graph daily balance line graph
Least-squares fit

A least-squares polynomial fit is used to account for the different employment types. After some experimentation, the analyst decides that the best fit to account for these trends is obtained by using a fifth order polynomial. The polynomial fit is named  poly , and the difference between  y  and  poly  is named  diff1 .

getting a least-squares polynomial fit
Octave requires:
f1 = ones(size(t));
polynomial fit, poly difference between daily balance and polynomial fit, diff1
Periodogram of diff1

As a preliminary analysis tool, the periodogram of  diff1  is found. Inspection of the matrix  [per sqrcoef]  shows that the three highest peaks occur at periods $\,14,$ $\,120,$ and $\,180.$

periodogram of diff1
Genetic algorithm

As another preliminary analysis tool, a genetic algorithm is applied to  diff1 . Due to memory constraints on the analyst’s computer, only a very limited application is made. However, a component close to $\,180\,$ does emerge in one of the ‘best-fit’ strings. Also, inspection of each population shows that period $\,365\,$ appears in many ‘high-fit’ strings.

Gradient methods

Based on these initial analyses, gradient methods are used to fit  diff1  with sinusoids of periods $\,14,$ $\,120,$ $\,180,$ and $\,365\,.$

After two iterations of the MATLAB function  nonlin , a best fit to  diff1  is obtained using periods of approximately $\,14,$ $\,116,$ $\,166,$ and $\,286.$ These are roughly $2$-week, $\,\frac 13$-year, $\,\frac 12$-year, and $\,\frac 34$-year components.

application of nonlin
The total approximation

By adding  poly  to the best fit obtained from  diff1 , the total approximation to the data is found. This approximation is graphed below using an ‘x’, superimposed over the actual data set (as a line graph). The graph is broken into $\,4\,$ smaller pieces for easier readability.

total approximation (graph #1) total approximation (graph #2) total approximation (graph #3) total approximation (graph #4)