Cannot Reproduce (By Hand) the Matrix Profile Values Generated with the Stump Algorithm #105

ssget2sumit · 2019-12-30T10:29:51Z

ssget2sumit
Dec 30, 2019

I am not able to match the match the matrix profile value.

Apart from this I need to know Z-Normalization needs to be done on Whole data set.

Or it needs to run on Chunks of data based upon window size. (Z-Normalization based upon Chunks of data)

I tried to match with matrix profile value when data is normalized and then do the pair wise distance calculation.

Even I tried when chunks of data is normalized and then do the pairwise distance calculation.

seanlaw · 2019-12-30T13:59:20Z

seanlaw
Dec 30, 2019
Maintainer

@ssget2sumit Thank you for your question. Can you please provide some code to demonstrate what you are trying to do?

STUMPY is taking two equal length chunks (aka “subsequence”) from a longer time series and computing the z-normalized Euclidean distance between them. This is repeated via a sliding window process. A more thorough explanation can be found in this Matrix Profile Tutorial but note that, for simplicity, what is shown/calculated is only the Euclidean distance so you want to apply a z-normalization yourself to each chunk. Also, note that for self-joins (comparing a time series to itself) also involves adding an exclusion zone in order to avoid a self-match.

0 replies

seanlaw · 2019-12-30T14:18:22Z

seanlaw
Dec 30, 2019
Maintainer

For completeness, here's a super slow and naive implementation taken from tests/test_stump.py and compared with the output from stumpy.stump:

import numpy as np
import numpy.testing as npt
from stumpy import stump, core

def naive_mass(Q, T, m, trivial_idx=None, excl_zone=0, ignore_trivial=False):
    D = np.linalg.norm(
        core.z_norm(core.rolling_window(T, m), 1) - core.z_norm(Q), axis=1
    )
    if ignore_trivial:
        start = max(0, trivial_idx - excl_zone)
        stop = min(T.shape[0] - Q.shape[0] + 1, trivial_idx + excl_zone)
        D[start:stop] = np.inf
    I = np.argmin(D)
    P = D[I]

    if P == np.inf:
        I = -1

    if ignore_trivial and trivial_idx > 0:
        PL = np.inf
        IL = -1
        for i in range(trivial_idx):
            if D[i] < PL:
                IL = i
                PL = D[i]
        if start <= IL < stop:
            IL = -1
    else:
        IL = -1

    if ignore_trivial and trivial_idx + 1 < D.shape[0]:
        PR = np.inf
        IR = -1
        for i in range(trivial_idx + 1, D.shape[0]):
            if D[i] < PR:
                IR = i
                PR = D[i]
        if start <= IR < stop:
            IR = -1
    else:
        IR = -1

    return P, I, IL, IR


def replace_inf(x, value=0):
    x[x == np.inf] = value
    x[x == -np.inf] = value
    return


if __name__ == '__main__':
    T_A = np.random.uniform(-1000, 1000, [8]).astype(np.float64)
    T_B = np.random.uniform(-1000, 1000, [64]).astype(np.float64)
    m = 3

    naive_matrix_profile = np.array(
        [naive_mass(Q, T_A, m) for Q in core.rolling_window(T_B, m)], dtype=object
    )

    stumpy_matrix_profile = stump(T_A, m, T_B, ignore_trivial=False)

    replace_inf(naive_matrix_profile)
    replace_inf(stumpy_matrix_profile)

    npt.assert_almost_equal(naive_matrix_profile, stumpy_matrix_profile)

You should be able to see that this assertion succeeds. Let me know if you have any further questions and I'd be happy to clarify/help where possible.

0 replies

ssget2sumit · 2020-01-02T12:39:17Z

ssget2sumit
Jan 2, 2020
Author

Thanks Sean for your quick support. Wishing you a very Happy New Year 2020 !!!! Case 1 : Suppose we are having a long time series from 02-July 2018 to 08-Sep 2018. It contains 69 observations I need to find a pattern for my input data as mentioned below : Input Data Contains : 10 observations 07-Jul-18 10852.9 08-Jul-18 10947.25 09-Jul-18 10948.3 10-Jul-18 11023.2 11-Jul-18 11018.9 12-Jul-18 10936.85 13-Jul-18 11008.05 14-Jul-18 10980.45 15-Jul-18 10957.1 16-Jul-18 11010.2 Business Use Case : Need to find the pattern of my input on the output data set [ demo_data.csv] First Question when we say z Normalization then : 1. We are doing z- normalization for whole data set which contains 69 observations. PFA demo_Data.csv Then after z normalization we are taking those normalized input data set. Then comparing those normalized input data set which is having 10 observations with normalized output data set based upon chunks as 10. My question is if we are using z normalization on whole data set it can happen some of the data is having higher values then variance is also more. or 1. We are doing normalization for splitted chunks based upon 10 observations Example : Normalization of first 10 observations i.e 02-Jul-18 10657.3 03-Jul-18 10699.9 04-Jul-18 10769.9 05-Jul-18 10749.75 06-Jul-18 10772.65 07-Jul-18 10852.9 08-Jul-18 10947.25 09-Jul-18 10948.3 10-Jul-18 11023.2 Then Normalization of Second 20 Observations i.e 11-Jul-18 11018.9 12-Jul-18 10936.85 13-Jul-18 11008.05 14-Jul-18 10980.45 15-Jul-18 10957.1 16-Jul-18 11010.2 17-Jul-18 11084.75 18-Jul-18 11134.3 19-Jul-18 11132 20-Jul-18 11167.3 Then Calculate the Euclidean distance based upon z-normalized distance obtained. Second Question : Why we are using euclidean distance because it is one to one mapping.Instead of that we can use Dynamic Time Series Wrapping distance formula. Please do the needful on this. Regards, Sumit

…

On Mon, Dec 30, 2019 at 7:29 PM Sean M. Law ***@***.***> wrote: @ssget2sumit <https://github.com/ssget2sumit> Thank you for your question. Can you please provide some code to demonstrate what you are trying to do? STUMPY is taking two equal length chunks (aka “subsequence”) from a longer time series and computing the z-normalized Euclidean distance between them. This is repeated via a sliding window process. A more thorough explanation can be found in this Matrix Profile Tutorial <https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html> but note that, for simplicity, what is shown/calculated is only the Euclidean distance so you want to apply a z-normalization yourself to each chunk. Sent with GitHawk <http://githawk.com> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/TDAmeritrade/stumpy/issues/105?email_source=notifications&email_token=AMNIU5SPJR3P4YSN3V4VT5TQ3H5DTA5CNFSM4KBLES62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH2LPOY#issuecomment-569685947>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMNIU5ULEE22PCT6NPZLSALQ3H5DTANCNFSM4KBLES6Q> .

0 replies

seanlaw · 2020-01-02T13:30:29Z

seanlaw
Jan 2, 2020
Maintainer

@ssget2sumit

Regarding question 1, it is the latter. We are z-normalizing each 10 observation chunk separately (independently, relative to its own mean and standard deviation from the 10 observations) and then computing the Euclidean distance between the two z-normalized chunks. This is performed as a sliding window across the entire time series.

Regarding question 2, STUMPY is typically used for finding patterns and anomalies from within your time series (many local comparisons and the output is many, many values). The research has shown that z-normalized Euclidean distance is enough to do the job. Now, there is something called MPdist for comparing two time series globally and returning a single distance metric but it has not been implemented yet.

_{Sent with GitHawk}

0 replies

seanlaw · 2020-01-13T19:16:43Z

seanlaw
Jan 13, 2020
Maintainer

@ssget2sumit Closing this for now. Feel free to re-open if you have any further questions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot Reproduce (By Hand) the Matrix Profile Values Generated with the Stump Algorithm #105

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Cannot Reproduce (By Hand) the Matrix Profile Values Generated with the Stump Algorithm #105

Uh oh!

ssget2sumit Dec 30, 2019

Replies: 5 comments

Uh oh!

Uh oh!

seanlaw Dec 30, 2019 Maintainer

Uh oh!

seanlaw Dec 30, 2019 Maintainer

Uh oh!

ssget2sumit Jan 2, 2020 Author

Uh oh!

seanlaw Jan 2, 2020 Maintainer

Uh oh!

seanlaw Jan 13, 2020 Maintainer

ssget2sumit
Dec 30, 2019

seanlaw
Dec 30, 2019
Maintainer

seanlaw
Dec 30, 2019
Maintainer

ssget2sumit
Jan 2, 2020
Author

seanlaw
Jan 2, 2020
Maintainer

seanlaw
Jan 13, 2020
Maintainer