Matrix profile (MPDist, motif discovery) applied to conversation patterns #149
Replies: 71 comments
-
@codyschank Thank you for posting this interesting application. I am by no means an expert on time series analysis but I certainly don't mind providing any suggestions that may come to my head (but know that they may also be bad suggestions too). To better understand, I have some questions for you:
IIRC, MPdist is a metric and not a distance measure and I vaguely remember the paper pointing out something related to triangular inequality (but I could be misremembering). If you see our performance plot, then you can get a ballpark estimate of the resources that you'll need to perform your computation. Not to speak ill of other implementations but our (likely biased) timing calculations compared with matrixprofile-ts so that you'll get around a 5x speed up on a pair of modest CPUs with STUMPY and even more improvement with GPUs (a potential for 30x speedup). Unfortunately, GPUs is your best bet and it's fairly easy to install and run. |
Beta Was this translation helpful? Give feedback.
-
Off the top of my head, I wonder if you could consider concatenating all of the calls into one long call. You may want to think about different strategies to reduce the size of the data by (+1 or -1) for every 5-10 words spoken by an agent/customer rather than for each individual word. That might get you into a regime where you have an order of magnitude less data to analyze (at least as a first pass). This might help you quickly identify all of the top regions to start looking for motifs. And then, afterwards, you can go back to the full resolution to perform a higher resolution matrix profile calculation but that directly compares a prospective motif (plus or minus some additional data points) with the entire concatenated sequence. Additionally, I wonder if you could benefit from this annotation approach published by Eamonn's group. Unfortunately, this is not implemented in STUMPY as I haven't had time to go through it thoroughly. |
Beta Was this translation helpful? Give feedback.
-
Is there any way to perform a sanity check to make sure that this is a viable approach? I know Leland McInnes (co-creator of HDBSCAN) fairly well and he is also familiar with STUMPY so we may be able to ask him to chime in once we've done some due diligence. It may be possible that the calls that you've analyzed are just fundamentally different and that we are potentially searching for a signal that isn't there. I'm not trying to draw conclusions here but just trying to keep an open/skeptical mind. 😄 |
Beta Was this translation helpful? Give feedback.
-
Btw, there is an existing issue for adding an example of MPdist in the Tutorials. Would you be interested in taking a stab at it? We try to base our Tutorials on the original papers (i.e., we try to reproduce a figure from the original paper). This would be greatly appreciated and I am here to assist! |
Beta Was this translation helpful? Give feedback.
-
Definitely, I can help with that. I noticed that existing issue and thought about offering to help. Looks like I should be able to get the data here https://sites.google.com/site/mpdistinfo/ |
Beta Was this translation helpful? Give feedback.
-
Let me know if you have any ideas for how to perform a sanity check. I'm also trying to approach this with skepticism and not just expecting that it will work on these data. I would expect there to be some patterns that repeat over and over, due to scripts that the agents follow and common problems that customers call in with. However, it does seem possible that the patterns would be different enough that the methods would not work to find clusters of similar calls. Your suggestion to plot calls with high and low MPDist seems like a good way to start investigating this. |
Beta Was this translation helpful? Give feedback.
-
Concatenating time series together does seem like a good approach. My expectation is that there are at least 3 general motifs, line going up (customer doing most of the speaking), line going down (agent doing most of the speaking), and a back and forth. The ultimate goal for me, aside from recognizing these motifs is to note where they occur in each call. The challenge I think (similar to the clustering) is fitting the "model" to a set of original data, and then applying the model as new data comes in without retraining (don't know if this analogy makes sense). |
Beta Was this translation helpful? Give feedback.
-
I'm a big proponent of looking at the individual raw time series that should "look" similar based on the MPdist value. Unfortunately, there is no substitute for this. Also, I'm wonder how much of a factor the (sliding) window size might play. Essentially, you'll have to figure out what an appropriate window size is for capturing a motif and this will require some domain expertise. Perhaps, the window size is, say, the median number of words needed (in total from both speakers) for the customer to have spoken twice or three times. Your guess is probably better than mine. |
Beta Was this translation helpful? Give feedback.
-
Yes, this makes sense. I sort of alluded to this above regarding the sliding window size. If you window size is too small then you won't capture "interruptions" or "changes" in speaker. If I understand correctly, "back-and-forth" would look like many small valleys and troughs. Is that correct? Not to dissuade you from using STUMPY but have you already tried applying simple statistics approaches to this? You may be able to segment each of your time series based on monotonically increasing/decreasing sections (i.e., only one person is speaking) and then ignore changes in that monotonicity that are too small or abrupt. Hmmm, I haven't thought this through thoroughly but I wonder if a Bayesian approach may be useful. I vaguely recall this example from Cam Davison-Pilon's book and I'm thinking if you could build models this way. Again, this is completely unverified and may be an unnecessary rabbit hole :) |
Beta Was this translation helpful? Give feedback.
-
def MPDist(tsa,tsb,m):
mp_ab = stumpy.stomp(T_A=tsa, T_B=tsb, m=m, ignore_trivial=False)
mp_ba = stumpy.stomp(T_B=tsb, T_A=tsa, m=m, ignore_trivial=False)
mp_abba = np.concatenate((mp_ab[:, 0],mp_ba[:, 0]))
n = tsa.shape[0] + tsb.shape[0]
k = int(round(2 * n * 0.05,0))
MPdist = np.sort(mp_abba)[k]
return(MPdist) You are almost right about the measure vs metric. A screen shot of the relevant passage from the paper is attached. Seems like maybe this might make MPDist a bad distance measure to use for clustering? |
Beta Was this translation helpful? Give feedback.
-
I have been wondering about this, and I'll probably start investigating this alternative soon. Perhaps calculating the slope of the time series within a moving window. |
Beta Was this translation helpful? Give feedback.
-
120 seconds or 120 words? You mentioned using transcripts but your plots above show time on the x-axis which then implies that your data takes into account pauses in talking? The code looks right to me at first glance (and is comparable to this save for the sorting). What leads you to think that you may have misunderstood the paper?
Yes, this might be a problem but your application of HDBSCAN seems reasonable (i.e., I would've done the same thing) |
Beta Was this translation helpful? Give feedback.
-
Regarding a sanity check, you may even take a time series,
And then see if any or all of these |
Beta Was this translation helpful? Give feedback.
-
120 seconds. The x-axis is time in seconds, and takes into account pauses.
The pseudo-code in the linked google site for MPDist lists MatrixProfile(TA, TB) as the only initial step before sorting and grabbing the value at k. This is what I did originally, but I ended up doing both MatrixProfile(TA, TB) and MatrixProfile(TB, TA) and concatenating the results, then sorting and grabbing the value at k. I did this because the Matrix Profile is not symmetric, and it appeared that it was right to do it the second way based on what as written out in the paper. I just need to reread that section more carefully. |
Beta Was this translation helpful? Give feedback.
-
Found a bug in the MPDist function (but still not 100% sure I have it right according to the pseudocode). Previously I was assigning T_A to tsa and T_B to tsb for both of the first two lines. Rerunning things to see what kind of clusters I get. The function is corrected below. def MPDist(tsa,tsb,m):
mp_ab = stumpy.stomp(T_A=tsa, T_B=tsb, m=m, ignore_trivial=False)
mp_ba = stumpy.stomp(T_A=tsb, T_B=tsa, m=m, ignore_trivial=False)
mp_abba = np.concatenate((mp_ab[:, 0],mp_ba[:, 0]))
n = tsa.shape[0] + tsb.shape[0]
k = int(round(2 * n * 0.05,0))
MPdist = np.sort(mp_abba)[k]
return(MPdist) |
Beta Was this translation helpful? Give feedback.
-
I think the referenced issue addresses a completely different issue, at least the implementation will be very different. However, an unnormalized matrix profile might be interesting, but I can't say how difficult it is to implement, because the distance calculation would need to change quite a bit. |
Beta Was this translation helpful? Give feedback.
-
@codyschank Your application very loosely reminds me of this article from a while ago. You may find inspiration in this story |
Beta Was this translation helpful? Give feedback.
-
I just read through the paper for removing noise from the matrix profile, and while it seems like it might help, I do think an un-normalized distance would probably do a better job with my use case. I did a little more searching and this paper came up: https://arxiv.org/abs/1901.05708 With an implementation by @tylerwmarrs here: |
Beta Was this translation helpful? Give feedback.
-
I just now noticed this closed issue #157 So maybe this would not work. Though, it's not speed I'm worried about, but the results from the motif discovery based on the matrix profile. |
Beta Was this translation helpful? Give feedback.
-
I was just about to point out that issue and you beat me to it :) |
Beta Was this translation helpful? Give feedback.
-
Despite the findings of @tylerwmarrs related to the AAMP algorithm, I am still very much focused on calculating a matrix profile based on non-normalized euclidean distance. I've been looking through the guts of the stumpy code, and today took my first stab at making changes. Seems to me like I would need to make the changes to https://ieeexplore.ieee.org/document/8392419 And specifically, this section: My thinking was the I could just substitute this formula for this one
The problem, I guess, is that I don't have sub-sequence A and B in |
Beta Was this translation helpful? Give feedback.
-
@codyschank Would you mind if we started this conversation in a separate Github issue? This way, it is focused specifically on non-normalized Euclidean distance and it won't distract from this issue. I understand they are related to the same underlying problem that you are trying to solve but I would like to limit the scope of conversations so that it is easier to cross-reference in the future. I really appreciate it! |
Beta Was this translation helpful? Give feedback.
-
Totally, I was wondering about that. This thread has grown rather long! I'll copy the last comment to a new issue. |
Beta Was this translation helpful? Give feedback.
-
While working through #207 I had the realization that what would work well for this data is not a matrix profile based on non-normalized Euclidean Distance. Ideally, when the distance between two subsequences is calculated, they would both start at 0. Does it make sense why I would want to do this? Are you aware of any applications of the matrix profile that attempt to do this? I'm not even sure what to google in searching for an application like this. |
Beta Was this translation helpful? Give feedback.
-
@codyschank this might give you some inspiration. |
Beta Was this translation helpful? Give feedback.
-
@codyschank Sorry, I'm not following what you are saying here. Care to elaborate or provide a simple example? |
Beta Was this translation helpful? Give feedback.
-
@seanlaw This is one example of a motif set using the regular (z-normalized) matrix profile. Note I have adjusted these subsequences to start at 0 when plotting them to make it easier to compare how well they match: I don't really like that the orange subsequence gets lumped with the red subsequence. The magnitude of the shapes is very different (approx -40 vs -3). It is the z-normalization of the matrix profile that puts them together. Here are the same subsequences without me plotting them all starting at zero. If I used the non-normalized Euclidean distance, orange and purple would have one of the larger distances, even though they are quite similar if you plot them starting from zero. The best result for me is to have green, brown, and red in one motif, and orange, purple, and blue in another. The example is not perfect, but I hope it gets the point across. It is one that could provide quickly. |
Beta Was this translation helpful? Give feedback.
-
@codyschank Ahhh, got it. Thank you. I will think about this. |
Beta Was this translation helpful? Give feedback.
-
@codyschank Do you mind if we close this for now? And feel free to start a new issue (referencing this one) if there are follow-ups? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Sounds good to me. I'll close it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a data set of voice transcriptions from tech support calls, between customers and tech support agents. I've converted these transcripts into time series by doing the following: every time a word is said by the customer (+1), conversely every time a word is said by the agent (-1). I've attached a few examples of what these time series look like (each image is the time series described above for a single phone call between agent and customer).
My initial goal was to cluster the calls based on the similarity of time series, and I discovered the KShape algorithm in my search for an appropriate method. In presenting this idea to a colleague, they suggested that I look into the Matrix Profile, which appears to be a very promising technique for analyzing these data. And that led me to discovering the MPDist algorithm.
I used the STUMPY package to implement a basic MPDist function, but it is slow (~10 hours) to run across even a small set of these times series (n=500), due to all the pairwise comparison between time series that are needed, and the results did not appear to form interesting clusters (using HDBSCAN). One clear advantage of using MPDist over KShape is that it can handle time series of different lengths, which is a requirement for my data.
I'm also interested in discovering motifs that are common across a set of these time series. I could see how to apply the motif discovery across a single time series, but applying the same tools to a set of time series was not clear to me.
Beta Was this translation helpful? Give feedback.
All reactions