This is a project I did as a Data Science Fellow at Insight Data Science in the January 2017 session, in consultation with Fast Forward Labs.

Fast Forward Labs is in business of building prototypes by taking cutting edge machine learning and AI research and evaluating its feasibility in the real world. They were interested in building a chat bot that incorporates some pretty cool recent methods for natural language processing.

And so I built a user-friendly and reasonably smart bot for Slack that helps users stay on topic.

Below you can find the links to the demo Slack Team where you can play with the bot (see Section 3 for details), a presentation that closely follows this blog, and the bot code on Github.

Demo Slack Team Presentation Code on Github


1. Why a smart policing bot?

2. Getting the data

3. Demo: Officer Slackbot in action

4. Bot brains

5. How smart is the bot?

6. Summary & what more can be done

7. About me

1. Why a smart policing bot?

Slack is a popular messaging app for teams that allows team members to discuss different topics, organized in channels. Now, when a new member joins a Slack team that’s been around for some time, they sometimes tend to post messages in wrong channels. Nobody wants to be that nagging senior team member trying to direct this rookie to a more appropriate place.

Wouldn’t it be nice to have a smart bot that can learn topics of different channels, monitor the discussions and then warn users if they go off topic? Building such a bot is the aim of this project.

2. Getting the data

In order to build my bot and see how accurately it’s performing, I needed some data. Slack data is hard to come by, since it’s private. The next best thing is Reddit, since its data is easily available and has a similar structure to Slack, where instead of channels, different topics are grouped into subreddits.

For the purposes of demonstrating the model, I chose the following five topics (subreddits): Diving, Handball, Corgi, Data Science, and Machine Learning. These have been chosen intentionally so that some of them are more similar to each other and others are less (plus, they also tell you something about the things I like!).

The relevant data (submissions and comments) can then be downloaded using Reddit’s excellent API through an easy-to-use PRAW package for Python, and stored in a SQL database. See the code on my Github for details.

3. Demo: Officer Slackbot in action

To showcase my bot’s might, I made a demo Slack team – go ahead and try it out! I created a generic user with a username and the password I have either shared with you when I presented the project or you can email me to ask for it.

In the demo Slack team I created 5 channels, corresponding to the 5 subreddits above, and populated those channels with the comments obtained from the corresponding subreddits. For simplicity, I focused only on comments, rather than the submissions, since they tend to be shorter, perhaps more faithfully mimicking the form of Slack messages.

To upload the Reddit data to my Slack team, I first registered 4 bot users on Slack (posing as famous characters on Seinfeld!), and used the excellent package slackclient that allows one to communicate with Slack’s API from Python. For more details on how to build simple bots in Python, check out my code here on Github and / or have a look at a great tutorial from the Full Stack Python blog. The bot itself is hosted on AWS, constantly monitoring the discussions in the demo Slack team.

Below is a little illustration of bot’s basic functionality, showing me entering a couple of messages in the Diving channel. As you can see, as long as the messages are vaguely related to diving, or are of generic content that could belong to any of the channels (e.g. ‘thank you’, etc.), the bot doesn’t bother me. But if I mention something more closely related to one of the other existing channels, the bot will let me know where those messages might be more appropriate.

4. Bot brains

Now that we’ve seen the bot in action, let’s see how does it do all this.

The bot obviously needs to be able to compare two messages (more generally, documents): the user’s input and one of the messages already present in a channel. A standard way to compare two documents is to use the bag-of-words (BoW) approach, which assigns a sparse vector to each document, with elements related to the number of times a given word appears in the document (this includes approaches such as tf-idf as well). Once such document vectors are generated, the similarity of the two documents is measured by calculating the cosine between the corresponding vectors: higher cosine similarity indicates more similar documents.

However, problems arise when two documents share no common words, but convey similar meaning, such as in the example on the right. The BoW / tf-idf vectors of these two documents are perpendicular, yielding zero cosine similarity.

4.1 Word Mover’s Distance

A way to fix this was proposed recently at the International Conference on Machine Learning in 2015 in the form of Word Mover’s Distance (WMD) model. This model starts by representing each of the words in a document with vectors in a high dimensional (word embedding) vector space by using word2vec (w2v), a popular neural network model trained to reconstruct the semantic context of words. What that essentially means is that the words that have similar meaning will have their w2v vectors close to each other, as illustrated below. I will use a pre-trained word2vec model from the Spacy package, which has been trained on the Common Crawl corpus.

Then, a natural way to estimate how dissimilar, or distant, the two documents are is to look at the distance between the corresponding word vectors and, roughly speaking, add those distances up. This metric is called the Word Mover’s Distance, because it is an instance of the well-known Earth Mover’s Distance (EMD) optimization problem, only formulated in the word embedding space.

What follows is an intuitive, but somewhat technical explanation of the EMD problem, important for understanding the modification of the WMD that I’ll have to implement. Click here for more details.

4.2 Modifying WMD

A practical obstacle in applying this method to our case is the fact that the EMD algorithm has a horrible time complexity: O(p^3 log(p)), where p is the number of unique words in the two documents. We would need to compare the user’s input to all of the previous messages in all the channels, calculate the average distance for each of the channels, and the one with the smallest average distance would be our prediction for the channel to which the user’s message should go. If the user posted the message in the channel we predicted, the bot doesn’t do anything, otherwise the bot will advise the user to consider posting it to the predicted channel. For Slack teams that contain a lot of messages spread out over a lot of channels, this will not be a feasible approach for a real time response of the bot.

So I needed to modify the WMD method somewhat. Comparing the input message to all the messages in a given channel seems excessive: surely there are messages that are more “representative” of the channel content than the others, and it’s likely enough to compare the user input to those messages only. However, this would require expensive preprocessing, in which we essentially have to sort the channel messages using WMD as a key. But can we somehow construct a single message representative of an entire channel?

Intuitively, this could be achieved by looking at word distributions in a given channel, as shown on the right. Obviously, to a person, looking at the first, say, 10 or so words that occur most often in a channel would give a pretty good idea of what that channel is about. A single message representative of that channel should therefore contain only those 10 words. To use this message in EMD / WMD, we need to choose the weights (see previous subsection) of the vectors representing the words in it. Since the weights in a standard WMD are directly proportional to how many times a given word appears in a message, we can make the weights in our representative message proportional to the number of times a given word appears in the entire channel (and then normalize it).

In this way we’ve constructed a single representative message for each channel, and we only need to calculate the WMD distances between the input message and each of the representative messages, find the shortest one, and predict the corresponding channel as the one the input message is supposed to go to.

5. How smart is the bot?

But is 10 words enough to form a representative message? Is it 30? A 100? We can find the optimal number of words, n_words, by treating it as a hyperparameter and tuning it on a validation set. Our entire corpus will consist of about 5000 comments of varying length from each of the 5 channels, which will be split into a training (70%), validation (20%) and test (10%) set.

The training set will be used to infer the word distributions for each of the channels. The text will be pre-processed in a standard way (by removing punctuation, stop words, etc.), and in addition to that we’ll also use Spacy’s parts-of-sentence tagging to only leave the nouns, verbs and adjectives, as they are the ones carrying most of the meaning. We will also remove all the words that are not in Spacy’s pre-trained vocabulary, since we won’t have a vector representation for those (luckily, only about 2.5% of the words are not in the vocabulary).

The validation set will be used to determine the best value for n_words, which turned out to be 180.

Finally, the test set will be used to compute the final accuracy of our model. On the right we see the confusion matrix for the test set, showing the distributions of messages from their true categories over the predicted ones. The accuracy of this model is about 74%, which is pretty good, and a noticeable improvement from 68% that one gets from the tf-idf approach and using the cosine similarity as the metric.

5.1 Introducing the threshold

In the confusion matrix we see some expected confusion with the closely related topics: for example, 24% of messages from the machine learning channel got misclasified as data science. But, if we look under the hood, we can see that a lot of these messages are in fact pretty generic (e.g. “thank you”) and could belong to any channel. In fact, our model picked up on that: the distances to all the channels for these messages are pretty similar, and it just happens that the distance to the data science channel was the shortest one.

We can therefore eliminate some of these messages by introducing a threshold: when the distance between the channel the message was posted in and the channel that the message was classified to belong to is smaller than some value epsilon, we’ll ignore the prediction of the model and the bot won’t advise the user. To keep things simple, we will use a fixed, relative threshold for the entire corpus, although this could be certainly improved by, for example, considering different thresholds for different channels.

As is usually done, we’ll treat the threshold as a hyperparameter, and tune it on the validation set. Click here for more details.

Now that we have chosen an optimal threshold, we can apply our model to the test set, and then manually check whether all the messages our model flagged as generic are indeed generic, and use that to update the effective accuracy of the model. This results in the final accuracy of about 84%.

6. Summary & what more can be done

In summary, I’ve built a user-friendly and a reasonably smart bot for Slack that helps users stay on topic.

This bot prototype can be obviously applied to platforms other than Slack, including Reddit and Stack Overflow. It can be also potentially developed into more advanced applications, including automatic email classification and perhaps even filtering out hate speech on Twitter.

This is a project I built in three weeks at Insight. That’s not a lot of time, but it’s been a lot of fun, and I’ve got a bunch of ideas I would love to implement some time in the near future:

Better estimate of bot’s performance in the real world [show / hide]

Starting the bot on an empty channel [show / hide]

More features [show / hide]

Augment a pre-trained word2vec [show / hide]

Density functions in w2v space [show / hide]

Making the bot more user friendly [show / hide]

Other approaches [show / hide]

7. About me

My name is Andrej Ficnar, I’m a theoretical physicist and an aspiring data scientist. I got my PhD from Columbia University in 2014, where I studied applications of string theory to real-world physical systems in order to better understand their properties. After that, I moved to the University of Oxford for a postdoc, where I got more and more interested in applying my coding and analytical skills to data-rich environments. This eventually led me back to New York in 2016, and then into Insight.

Check out my profile on Linkedin and my codes on Github for more info about me.