Building and Training Automated Topic Labelling Models
In this video Alexandra shows how to automatically classify massive amounts of historical failure data.
VIDEO TRANSCRIPT
I see event labeling as multi-step process. The first one is, as always, pre-processing. Everybody knows low quality data in, low quality out.
I live in Norway. We also have clients in Germany, and we recently worked with data that was from Azerbaijan, so it was in Azeri. So, you need everything in one common language.
The first thing is how to identify the language. Then do a spell check, because it's pretty normal. People make typos and the translator doesn't really react well to that. You have to convert everything to English and then start tokenizing, so splitting it up into sentences and words. If you think about the maintenance record we had on the previous slide, one token could be flare valve, another one could be internal leakage.
Then start assigning parts of speech. Parts of speech are important if you're looking to label equipment, because equipment tends to be indicated by nouns, while failure modes are usually verbs or verbal nouns. You can use these parts of speech to give an indication of like the probability of what type of label you should assign.
Then you can use clustering. You can use a wide variety of clustering algorithms. You can use that to find groups of notifications or events that are semantically similar to the ones that are describing the same concepts.Then you can look at the output from each of those clusters and you can start to see the features and maybe start to decide what the labels are. The thing is, if you're going to have a subject matter expert do this, they're not data scientists and it's really difficult to optimize the number of clusters. Maybe they choose to do 10 clusters, but then you see, okay, five of them are obvious but the other five, they probably have sub-clusters. Then what the user could do is do this an iterative way and label the clusters that they deem is good with the labels that they can clearly see from the features of the cluster, and then send other ones back and continuously whittle down.
What could this process look like in practice? The ships, and rigs, and power plants, they could batch upload all of their event data that doesn't have any labels. You could do this pre-processing step, tokenizing, translating and make clusters around equipment or failure modes, depending on what the user wanted and return those to a user. The user decides on whether the bad clusters send those back for re-clustering, the ones that he seemed good they get labeled than the event data will be updated. This maybe takes three or four iterations until they're satisfied with the results.
I want to show you how we've turned this into an application, so how it might look in a practice. You could upload a CSV file and it would be processed. You can choose the number of clusters, I'm going to say six here. There, you can choose different algorithms that you want to do maybe K-means and a math LDA, if you want to do topic mining or even density based clustering. I'm going to use K-means. I like K-means. It does all of this processing in the background and it is going to return six word clouds. I do know that word clouds are really passé, I went to a text data visualization session yesterday, so I have a lot of other ideas about how we could change that.
You can see that each of the clusters, they have different features associated with it and you see that leakage occurs 98% in it. Maybe the user, they want to see the individual notifications that were within it. They can start to see, okay, this looks like this is a leakage cluster, this is probably failure mode leakage.
So, I might call this leakage. I'll save it, this one is all about defective lighting. That's not really interesting, but okay. Defective lighting. This one, okay, it has the words pressure, switch, transmitter, so it seems like it's all about switches and transmitters. I'm just going to write defective switches. Save. This one looks really generic. It has replace error, REP, that's probably replace, defective, replace defective. This one's too generic to make a concrete decision on what the failure mode is, so I'm going to call this bad cluster. Same thing for this one; damaged, repair, corrosion, too big. I don't want to label any data with that.
This one, okay, this is slow close time. A lot of valves, they can have this failure mode, where the valves just aren't closing. I'm just going to say slow, close. Save.
I can re-cluster it, and so what that will do is it will re-cluster these two clusters that I called bad. I want to make four clusters, so we're going to use K-means again, and process those clusters and then get some new output. Now I have four clusters and I started to see, okay, some of them look a little bit better. I can see the results from my last iteration, and you can continue on this way, until eventually you've labeled everything. I think that we started with maybe 5,000 rows at first. We labeled 1,200 and now we're going to label the last 3,000. This is just the way that you can keep humans in the loop, you can empower subject matter experts at your clients to label their own data without asking them to go through line by line, thousands of data.
After I've asked somebody to label all of this data, what next? How can we capitalize on this knowledge and the time that they put into this, so that the next time they get in new data, they don't have to label it? Well, you can use the output from this labeling, so after the users pleased with that output to train a prediction algorithm. You can train the model, create the model.
The model that I like to use is called label spreading, available on scikit-learn. You can use this to predict labels on new data. What this model does is maybe in this case, it's learned what four different types of labels look like. So, leakage, fail to start on demand, fail to close on demand, structural deficiency-- Okay.
We start to see leakage, it's a very good chance that it is this label if it has the words leakage or leak in it. You see that, that they assigned a probability of 99 or 96%. Then, you have “fail to close on demand”. It's pretty high probability it's “fail to close on demand”, if it says, didn't close. You can use these probabilities to increase the reliability. You don't want to just send the model and have it go rogue, and start predicting things, and maybe it's predicting the wrong things, because you're starting to see new failure modes.
I've seen that incorrectly labeled data tends to have a lower distribution of probabilities and that's probably just because it's new words and new terms. You can make a cut on probability and you can incorporate this into your model deployment. Let's imagine we've already done this batch training, we've created a model and now we're streaming in new data So, the model is going to predict a label for the new data with an associated probability. If it is a high probability, like over 90%, 95%, you can decide the threshold that you're pretty confident that the label is correct, and you can just route this back to the database with the label.
Then anything else, that's a low probability, if it's below 90%, you can include the human in the loop and ask their opinion and you can do this in a batch fashion. Maybe once a day, once a week, ask them to look at all of the notifications that were a relatively low probability and ask them to either verify the label that the model gave or change it. Once the user has done that, you can filter back in those results and then retrain the model. Eventually the model, you hope will be smart enough to do it on its own.