Written by Mohith Subbarao, Software Engineering Intern – Machine Learning
Every goal you’ve ever had or ever will have in your professional career — building a business, finding a job, learning about an industry — involves using your professional network in a meaningful way. Our mission at 4Degrees is to intelligently help you maximize the opportunity of your network in innovative and genuine ways. To do that, we’ve found that it’s not sufficient to simply know your contact’s work history or what city they live in. We have to know who they really are: their areas of expertise, their interests, their passions. That’s where “tags” come in. Tags allow us to get a multi-dimensional picture of the industries and communities (VC, Sales, Data Science, etc.) that comprise your professional network. We can tell you who you should be getting referrals from, reaching out to for advice, or even partnering up with — all in the hopes of accomplishing the goals and dreams that matter to you.
It is possible to manually generate these tags, such as the user adding tags to the 4Degrees platform. This can take several minutes per contact; when your network reaches the thousands or ten-thousands, it’s no longer scalable. This is why automating the tagging of contacts is a critical goal for us. By increasing the depth and coverage of tags, we can exponentially expand the avenues through which our platform can be an essential asset to your workspace and professional career.
To figure out a way to predict tags, we looked to Twitter. People often use Twitter as a primary form of communication with their network to post about their interests, passions, and professional work. Our hypothesis is that the content of someone’s tweets can truly help us predict which tags apply to that person.
That is why we built a machine learning (ML) classifier that predicts users’ tags based off their tweets. As our user base continues to grow, our ultimate goal is to automate the prediction of tags. This model is a fundamental part of achieving that goal.
There were 5 steps to this process:
Step 1: Importing Tweets
We started with our training data: a set of roughly 10k contacts and the associated tags for each contact. From these contacts, we import tweets from Twitter using Tweepy, a comprehensive Python interface for accessing Twitter’s API. This is how we will eventually get our training data.
One of the challenges of this step was figuring out the best way to save these tweets locally, as importing ~1000 tweets for each of 10,000 users took a long time (roughly 15 hours!). We needed this data for all successive steps, so it was a must to have these tweets saved; there was no way we could afford to re-download thousands of tweets every time we ran and tested our program. We played around with the idea of creating one csv file with all tweets of all users or using a database, but ultimately decided to create a csv file for each user, with each row as a single tweet. This plan made it easy to both store and access the data. At scale, this could run into memory issues as Pickle files are incredibly memory inefficient. In addition, working with these files between computers proved challenging. In the future, using a database may be a better solution.
We received 200 tweets at a time until there were no more tweets to download or we had hit our per-user limit of 1000. We felt this was sufficient data to capture a user’s word frequencies. The tweet object imported from the API had a great deal of data: time posted, user, retweet count, location, device, etc. We decided to pull just the text from each tweet because we were most concerned with the content of users’ tweets. We ran into two common Twitter errors that were relatively easy to deal with, yet worth noting. The first was a RateLimitError, as Twitter API limits the number of times you can pull in a 15-minute timespan. The fix was simple: we added code that told our program to sleep for a few minutes before pulling again. The second was a TweepyError, which often signaled the user had deleted or disabled their Twitter. For these, we simply returned a file with no tweets — in the next step, we check if a file is empty before iterating through it.
Step 2: Creating a Bag of Words
For each user, we now have a list of tweets. From these tweets, our next step was to create a “bag of words”: a common Natural Language Processing (NLP) model where we store every word that has appeared in all tweets among all users. To read more about BoW, this resource is helpful.
We iterated through all the tweets, adding a mapping from each unique word to its frequency (# of occurrences / total words). We then sorted these words into a list in descending order of frequency, so we could have a fixed order for our bag of words. Finally, we decided to take out any words that appeared less than 10 times (among roughly 100,000+ tweets), which left us with about 30,000 words in our BoW.
There were a couple challenges here. First, we needed to create a good list of “stop words”, words that should not be included in the BoW. This also involved us adding many punctuation, common single characters, and twitter-isms (rt, http, etc.) due to the nature of tweets. Second, we had to figure out the best way to parse the tweets. We used Natural Language ToolKit (NLTK), a python library for NLP. We used two main tools from NLTK: tokenization, which parsed each tweet and returned a list of words, and lemmatization, which returned the base of each word to avoid adding different forms of essentially the same word (give, giving, given, etc.). NLTK was fundamental in helping us intelligently create our bag of words. Without these tools, trying to manually decipher the separate words in each tweet and find the base roots of these worlds would have proven both difficult and inefficient, as there were several edge cases to consider. Tokenization and lemmatization made these objectives trivial.
Step 3: Creating User-Tag Vectorizations
Once we had our bag of words, we could create an array for each user-tag mapping. For each mapping, we stored an array containing the word frequencies for that user along with the associated tag. We now had our training set, i.e. the data (word frequencies for each user) that leads to a prediction (specific tag).
As we calculated these frequencies, we decided to only include users that had at least 100 tweets. We believed that having minimal data for a user could make it difficult to truly capture their word frequencies; we did not want to risk negatively skewing our overall training set.
The logic here closely followed the logic of step 2. The main difference is that we would get the word frequency for each specific user, rather than the word frequency over all the users. In addition, it was imperative that we got the word frequencies in a standardized order; otherwise, the numeric frequencies would have no contextual meaning. To accomplish this, we iterated through our Bag of Words, and checked to see if the user used that word; if not, we inputted 0 for that element, if so, we calculated the word frequency.
It may be worth noting that we stored our Bag of Words, All User Frequencies, and All User Tags into 3 separate pickle files for repeated ease of use, modification, and testing.
Step 4: Testing the ML Classifier
We could now create, test, and improve these models. For each of the 12 tags we are hoping to predict, we created a separate logistic regression model. Being a supervised learning model, logistic regression provides a straightforward way to take in a set of data and make a binary prediction — in our case, whether or not this tag should be applied to this user. We test the models by splitting our data into subsets of target data and test data.
To create each model, we implemented a 50/50 split of our original data before making training/testing data, with 50% being user-tag mapping where the tag exists, and the other 50% where it doesn’t. We then set 85% of this data to be our training data, and 15% to be our test data. Using scikit-learn, we then were able to create a model using this training data, where X_train is the user frequencies, and y_train is the user tags.
Note: for y_train and y_test, we needed to iterate through them and change the elements to binary outputs for logistic regression to work, i.e. for the VC model, “VC” → 1, “Sales” → 0.
To test each model, we predicted tags on the testing data, and then checked the precision of the model. We care about precision (TP / TP + FP) as we are trying to minimize false positives, i.e. we don’t want to say someone is an Engineer if they are not. At the same time, we don’t solely want to maximize precision. We also care about recall (TP / TP + FN), as we also would like to minimize false negatives, i.e. we want to correctly predict that someone is an Engineer. For example, predicting 3 TP, 0 FP, 100 FN gives us a precision of 100% but a pretty terrible recall of ~3%. So finding a balance between the two was crucial.
To improve the models, we tried many techniques such as changing stopwords or adding new features such as the length of tweet. These led to only marginal improvements, and sometimes it made our model worse! However, the one technique that proved to be truly beneficial was changing the decision threshold of the classifier. The default is 50%, meaning that if the model says that a specific user has a 50.4% chance of being a VC, that user will be classified as a “1”. We found that setting the default to (median*threshold) for each model helped us find a balance between precision and recall. We ended up deciding on ~75% precision and ~60% recall, meaning that there was a good ratio between TP:FP and between TP:FN.
Step 5: Packaging our ML Classifier
Finally, we can package our classifier! We use a 50/50 split of our data as described earlier.
As said earlier, this is part of the ultimate goal of automating tagging in our application. This model is just one piece of the puzzle; tweets are an essential asset to predicting tags but simply taken alone may not be as reliable as desired. Therefore, we plan on leveraging other data sources with similarly rich context about a person’s expertise. We hope that using multiple data sources in conjunction can help us precisely predict tags for many users. Knowing these tags is fundamental to the 4Degrees platform and will allow our users to get the most out of their professional network.