Search For A Known Pattern In Time Series Data Using MASS #214

Jacks349 · 2020-07-12T09:46:44Z

Jacks349
Jul 12, 2020

I was making some research on pattern recognition with Python and i found this library. First of all, i think it's very interesting, so good job! I'm just getting started to this, so i apologize if this question is naive.

Second, i wanted to ask if Stumpy is the right choice for what i want to do:
Basically, i'm trying to detect patterns from OHLC trading data, it works like that: i have a specific pattern with a specific shape defined by myself and an external dataset. What i want to do is to check if and where, in the dataset, there is a shape which is similar to the pattern i specified.

Here is how i did it:
The pattern specified by myself is a series of local minima and maxima normalized in terms of percentage that, when charted, has a certain shape: Pattern = [7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172]

Then i have the total dataset, which is a set of OHLC data normalized in maxima and minima too: Data = [2.1502119927316805, -2.282834272161288, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 1.3121546961326038]

So basically, what i need to do is check in which parts Data takes a similar shape to Pattern. That said, my questions are: is this a use case for Stumpy? And if it is, which function am i going to use? In the examples, the dataset is checked against itself, while in this case i'm checking two different datasets.

I tried the following:
mp = stumpy.stump(Data, len(Pattern), Pattern, ignore_trivial=False)

But got the following output:

A large number of values are smaller than 1e-05.
For a self-join, try setting `ignore_trivial = True`.
[[0.0 8 -1 -1]]

seanlaw · 2020-07-12T18:40:58Z

seanlaw
Jul 12, 2020
Maintainer

@Jacks349 Thank you for your question and for your kind words. Firstly, since you are working with market data, I feel obliged to explicitly state that I am not a financial consultant. What I write here only pertains to the general usage of STUMPY for analyzing time series data and should never be taken or interpreted as financial advice. Past performance is not indicative of future results and you should consult with a registered investment advisor.

is this a use case for Stumpy? And if it is, which function am i going to use?

In short, this is the ideal use case for STUMPY. In the most general case where you have a time series and do not know if a pattern/motif even exists then you should use stumpy.stump to compute the matrix profile. The minima from the matrix profile identifies the location (index) of potential patterns/motifs to explore further. Note that this "blind" search is not guaranteed to uncover a pattern/motif. In fact, this is the beauty of matrix profiles as it makes little to know assumptions about your time series (i.e., it doesn't assume that a pattern already exists). This blind, "pairwise sliding window" search is also significantly more computationally expensive than when you already have a pattern that you are interested in finding in data.

In this latter case, when you already have a pattern (a subsequence with a particular shape) in hand, then you'll want to use a different function called mass (that isn't documented in our public API):

from stumpy.core import mass
import numpy as np

Pattern = np.array([7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172])

Data = np.array([2.1502119927316805, -2.282834272161288, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 1.3121546961326038])

distance_profile = mass(Pattern, Data)

Here, the distance_profile is a vector of length len(Data) - len(Pattern) + 1 and contains the z-normalized distances between the Pattern and each subsequence (of length len(Pattern)) within Data. So, the smallest value (or local minima) within your distance_profile are also your best "matches".

If you haven't already, I strongly recommend reading this tutorial, which explains the details between a "matrix profile" and a "distance profile".

Let me know if this makes sense.

0 replies

seanlaw · 2020-07-12T18:43:28Z

seanlaw
Jul 12, 2020
Maintainer

Also, note that we have a PR (work-in-progress) that should make this type of search easier.

0 replies

Jacks349 · 2020-07-12T20:41:07Z

Jacks349
Jul 12, 2020
Author

@seanlaw
Thank you really a lot for your answer, it was incredibly helpful and made me understand more of how this cool library works!
Of course (sorry for putting you on an uncomfortable spot here) i'm not asking in any way for financial advice, the purpose of this question is strictly technical and only concerns how Stumpy works.

I'm sorry for sounding naive again, but i'm just getting started to this matters and i'm having some troubles understanding the output of the distance profile:

Printing distance_profile will give me the following:

[2.92781934e+00 4.55434915e+00 4.21544139e+00 3.29336127e+00
 4.72614564e+00 2.94202855e+00 3.33790488e+00 4.62672866e+00
 6.82856991e-08 4.51937582e+00 3.47144433e+00 4.17966567e+00
 3.26871969e+00 4.72146046e+00 2.53070957e+00 4.46398626e+00
 3.64503919e+00 2.64282983e+00 4.81577841e+00 2.69799924e+00
 4.64286098e+00 2.67446216e+00 4.52739326e+00 2.54663088e+00
 3.77556508e+00]

I understand the concept of z-normalized distances, and i also understood that the lowest values are what i'm looking for; the trouble i'm having is understanding how to go back to the "target" dataset. I mean, once i computed the Distance Profile, how do i understand, from it, to what parts of my target dataset (that in this case is Data) do the lowest values correspond to?
Thank you a lot again.

0 replies

seanlaw · 2020-07-12T21:37:12Z

seanlaw
Jul 12, 2020
Maintainer

once i computed the Distance Profile, how do i understand, from it, to what parts of my target dataset (that in this case is Data) do the lowest values correspond to?

So, conceptually, the mass function is doing the following (but in a highly optimized way):

window_size = len(Pattern)
distance_profile = np.full(len(Data) - len(Pattern) + 1, np.inf)  

for i in range(len(distance_profile)):
    distance_profile[i] = compute_z_normalized_distance(Pattern, Data[i : i + window_size])

So, the ith element in your distance_profile corresponds to the z-normalized Euclidean distance between your Pattern and the ith subsequence in your Data (i.e., Data[i : i + window_size]). For example, distance_profile[3] has a value of 3.29336127 and this is computed as the z-normalized distance between your Pattern and the the i=3 subsequence from your Data of length window_size:

distance_profile[3] = `compute_z_normalized_distance(Pattern, Data[3 : 3 + 7])

Given that the smallest value in your distance_profile is 6.82856991e-08, this exists at index i=8 in the distance profile and also corresponds to the same index in your Data. So, if I counted my indices correctly, the subsequence Data[8:15]:

[7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172]

is the "closest" (by z-normalized Euclidean distance) to your Pattern (with a distance of 6.82856991e-08, which is essentially zero or an identical match).

Does that make sense? If not, feel free to ask for more clarification where necessary. I'd be happy to elaborate more

0 replies

Jacks349 · 2020-07-12T22:50:12Z

Jacks349
Jul 12, 2020
Author

Of course this makes sense! This is exactly what i was looking for. Even if i'm extremely newbie to all of this, i'm amazed at how much stuff you can do this library. I'll make sure to ask any other noob question i should have! A little feedback about the docs: the visualizations on the article you linked me are very helpful, they make everything easier to understand.

0 replies

seanlaw · 2020-07-13T00:57:38Z

seanlaw
Jul 13, 2020
Maintainer

Of course this makes sense! This is exactly what i was looking for. Even if i'm extremely newbie to all of this, i'm amazed at how much stuff you can do this library. I'll make sure to ask any other noob question i should have! A little feedback about the docs: the visualizations on the article you linked me are very helpful, they make everything easier to understand.

@Jacks349 Thank you for the feedback. The tutorials take quite a bit of time and effort to make but we try very hard to communicate the concepts as simple as possible. We are always open to being better. I'd love to learn more about how you are planning to leverage STUMPY. Perhaps, you'd like to connect on LinkedIn. Happy data mining!

0 replies

Jacks349 · 2020-07-13T08:40:41Z

Jacks349
Jul 13, 2020
Author

I'd love to learn more about how you are planning to leverage STUMPY

Basically my idea was to have a series of patterns specified by myself and a target dataset. What i'm trying to do is make a script that checks for each of the patterns specified by me in the "target" dataset, and for each it needs to find where the "shape" is most similar and give an output like "detected x pattern from y1 to y2".

So what i did is:

Get OHLC data
Compute local minima and maxima for this OHLC data (the first idea was to use raw price data but there would have been too much data, using local minima and maxima filters a lot of noise out and makes it easier to see patterns) using:

df['min'] = df.iloc[argrelextrema(df.Open.values, np.less_equal, order=n)[0]]['Open']
df['max'] = df.iloc[argrelextrema(df.Open.values, np.greater_equal, order=n)[0]]['Open']

Chart the data and extract the pattern i want (the pattern will be made of local minima and maxima)
Now, since local minima and maxima will have a certain price X but other datasets have another price Y, it would be better to normalize them somehow by converting each local minima or maxima to the variation from the previous point, so i will get a one dimensional array
The missing link! This is the point were i am supposed to detect the patterns that i exctracted on other datasets. This is the most difficult point and the fact that i am extremely new to signal processing, statistics, pattern recognition and machine learning (because i thought of doing it using ML somehow) didn't help at all. So i was making some research and found about Stumpy on Reddit, and even though i didn't grasp a lot of the concepts, i understood it was something i would definitely need.

So for now i'll use the function you mentioned and in the meantime i'll keep going through the docs to see if there are/will be other functions that can help me on this.

Cheers!

EDIT: I'm sorry if i'm bit though on this, but i have one more doubt:
So the output of the Distance Profile is:

[2.92781934e+00 4.55434915e+00 4.21544139e+00 3.29336127e+00
 4.72614564e+00 2.94202855e+00 3.33790488e+00 4.62672866e+00
 6.82856991e-08 4.51937582e+00 3.47144433e+00 4.17966567e+00
 3.26871969e+00 4.72146046e+00 2.53070957e+00 4.46398626e+00
 3.64503919e+00 2.64282983e+00 4.81577841e+00 2.69799924e+00
 4.64286098e+00 2.67446216e+00 4.52739326e+00 2.54663088e+00
 3.77556508e+00]

And the smallest value is 2.53, which is at index 14. So that means, if i understand correctly, that the most similar subsquence is from index 14 to index 21 of my Data array. But since the pattern is:
Pattern = np.array([7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172])

And the target is
Data = np.array([2.1502119927316805, -2.282834272161288, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 1.3121546961326038])

Shouldn't the most similar subsequence be at index 8, since at that index there will be exactly the same values of Pattern? Thanks again!

0 replies

seanlaw · 2020-07-13T11:39:44Z

seanlaw
Jul 13, 2020
Maintainer

Shouldn't the most similar subsequence be at index 8, since at that index there will be exactly the same values of Pattern? Thanks again!

Sorry, you are absolutely right! In my haste, I misread the values from your distance_profile output and missed 6.82856991e-08 (and only looked at the first few significant digits). I have updated my response from above to reflect this.

0 replies

Jacks349 · 2020-07-13T12:37:31Z

Jacks349
Jul 13, 2020
Author

Thank you a lot! Everything is clearer now!

0 replies

seanlaw · 2020-07-13T13:17:01Z

seanlaw
Jul 13, 2020
Maintainer

@Jacks349 I also highly recommend this resource from the original matrix profile authors that outline different things that you can do with the matrix profile. I think this will inspire other ideas for your exploration

0 replies

Jacks349 · 2020-07-13T13:54:26Z

Jacks349
Jul 13, 2020
Author

@seanlaw
This resource is incredibly useful for what i'm doing, it's exactly what i'm looking for! Thank you really a lot. Looking forward to finding even more material on this matters

0 replies

Search For A Known Pattern In Time Series Data Using MASS #214

Uh oh!

Uh oh!

Jacks349 Jul 12, 2020

Replies: 11 comments

Uh oh!

Uh oh!

seanlaw Jul 12, 2020 Maintainer

Uh oh!

seanlaw Jul 12, 2020 Maintainer

Uh oh!

Jacks349 Jul 12, 2020 Author

Uh oh!

Uh oh!

seanlaw Jul 12, 2020 Maintainer

Uh oh!

Jacks349 Jul 12, 2020 Author

Uh oh!

seanlaw Jul 13, 2020 Maintainer

Uh oh!

Uh oh!

Jacks349 Jul 13, 2020 Author

Uh oh!

Uh oh!

seanlaw Jul 13, 2020 Maintainer

Uh oh!

Jacks349 Jul 13, 2020 Author

Uh oh!

seanlaw Jul 13, 2020 Maintainer

Uh oh!

Jacks349 Jul 13, 2020 Author

Jacks349
Jul 12, 2020

seanlaw
Jul 12, 2020
Maintainer

seanlaw
Jul 12, 2020
Maintainer

Jacks349
Jul 12, 2020
Author

seanlaw
Jul 12, 2020
Maintainer

Jacks349
Jul 12, 2020
Author

seanlaw
Jul 13, 2020
Maintainer

Jacks349
Jul 13, 2020
Author

seanlaw
Jul 13, 2020
Maintainer

Jacks349
Jul 13, 2020
Author

seanlaw
Jul 13, 2020
Maintainer

Jacks349
Jul 13, 2020
Author