Baby steps into Binary Text Classification10 Nov 2015
One of the more simple problems into machine learning is Text Classification in English language. Lets face it, English is one of the easiest to pickup languages, only 26 character sets, understood by computers and people worldwide. Although, this text classification gets challenging when you have lots of unstructured data at hand.
We all more of less now know of the Big data hype. In simplest of terms, its a huge volume of data generated everyday every second due to your online activities. One of such gold mine of data is Twitter, generating petabytes of content everyday ranging from news, gossip, drama and whatnot. Incidentally, Twitter datasets have a great interest to the research community, nowadays most data scientists have more or less have worked with it, and so are lots of tools available.
In this post, I will try to explain a bit about a basic text classification using such a Twitter dataset, so that hopefully you get enough interest to start off with Kaggle competitions (oh btw, if you are new into this, Kaggle is a company who hosts regular competitions on machine learning online, and uber nerds fight off for that glory!). Btw, I will be using R here, although everything can be replicated in python.
Problem : Before diving in to the how, lets look at why. Say we have a dataset of news articles from Twitter in CSV format, such as ones from @ndtv, @timesNow, and we want to classify those tweets into 2 broad categories, crime and non crime
Example of crime news tweet (ouch!) :
“Snatcher slashes elderly citizens in Kolkata’s Golf Green: Elderly residents of the posh Golf Green #KOLKATA #NEWS,http://t.co/FOW4XX9inD”
Example of non crime news tweet (yay!) :
“Kolkata’s truffle rosogolla gets Forbes stamp: For the last 11 years, this entrepreneur hasﾃ #KOLKATA #NEWS,http://t.co/RyzFQFzOwU
Our job is to create an automatic classifier, which can easily distinguish between these two.
Labelling : This part requires human intervention and generally takes quite time (boring as hell!). If you are participating in a Kaggle competition this will be already done for you.
The thing is, we have lots of twitter news data, but we have to store somewhere which data correspond to which class. In our dataset, we have created a new column named
target and stored the labels of classes such as 0 corresponds to Non crime, and 1 corresponds to crime. You can get our labelled dataset here.
Prepare the Data : Notice few quirks of the above tweets. Since a tweet cannot be more that 140 characters, news agencies tend to include a link to the full article. Also we can see the hashtags embedded in the tweet. Removing these things, we get the actual data is less than 140 characters.
First, we load the data in a dataframe in R
We define few regex for the patterns to be extracted from the data. Regex is a handy way to quickly find text patterns.
Then we strip those from our data!
crimeData$text notation is to represent the
text column of the dataset. Now that we have a cleaned data, lets begin to “train” a classifier.
Training : In supervised learning scenarios, we typically train a classifier on a sample dataset, and then test its accuracy on other. While in unsupervised learning, we let the machine to figure it out on its own. Here, we will be looking at a basic supervised learning mode via a popular algorithm, known as SVM (Structured Vector Machines). If you want to learn more about SVM, read this blog and this lecture slide.
Historically, SVM algorithm is one of the de-facto choice for text classification due to a lot of reasons. Also, SVM is a fast algorithm which can be setup with a choice from mutiple “Kernels” (envision it like a kind of graphics card needed for a pc to run a game).
Next, we need to split our data into a training set and a test set. General convention varies from 80%-20%, 70%-30% split. Lets split it in a 70%-30% ratio. Notice we are not actually splitting the data yet, just storing the split indexes.
Then we separate the training data (i.e the
text) and training codes (i.e the
target) from our dataset.
Now, we will instruct RTextTools to create the training model for us. Now SVM does not understand text data, it only understands numeric weights. Fortunately, RTextTools gives us an option to create a matrix from the text data, by calculating the word frequency using tf-idf (Term frequency - Inverse document frequency) metrics.
Lots of things RTextTools handled for us there. The code is mostly self explanatory.
stemWords=TRUE indicate we want to stem words such as “cars” to their root “car”,
removeStopwords=TRUE indicate we want to remove unneccessary stop words like “a”, “the”,”and” etc. (Read the reason here).
Next, we will specifiy the train test split. Remember we saved our split indices previously?
Finally, we train our model and get the results!
The results vector contain the predicted classes. Suppose our test vector contained
[0,1,0,0,1]. Then we would want our results vector to be exactly same. Lets see the performance analysis of our classifier.
Let me first define this metrics. Simply put,
The better precision, the better accurately the classifier is identifying the class. The better recall, the less misclassifications the classifier is doing.
In our case, the recall is quite low! How did this happen? Wait a minute, we havent analysed our data properly yet! Lets do check our class distribution in the data.
Oh so thats the issue. See in class 1, i.e in our case the crime data, the proportion (yes thats what
prop.table does) of data is extremely less (429 vs 5897), only 6% ! Our machine wasn’t able to properly train for crime classes, due to huge abundance of non crime classes. Sorry machine, my bad!
Typically, any machine learning model works best when the class distribution is more or less equal. While in real world cases like these, hardly you will find equal class distributions (which IS desirable in crime scenario for obvious reasons). Therefore, we have to manually tweak the data to make things work ok.
Sampling : To make our classes more or less equal, we have to perform sampling. Sampling comes in three varieties :
- Oversampling -> Increase minority class value keeping majority same
- Undersampling -> Decrease majority class value by keeping minority same
- Hybrid Sampling -> Proportionately increase and decrease both classes
Where do we get these extra data? From the original dataset itself! We will be using a very useful technique called SMOTE, which basically creates such synthetic data from the dataset. Read more about how to implement SMOTE in R from this awesome walkthrough.
Lets get going! We will try to oversample our crime data.
We are calling SMOTE with various parameters, lets break it down :
target: variable name of the class column
perc.under: Percentage of undersampling to be done. Here we are reducing it by 100%, i.e aiming to reduce the majority class to half of its population
perc.over: Percentage of oversampling. Here we are increasing the minority by 800%, i.e. aiming to increase the minority class eight times its population
Lets see what it results in :
Whoa! The class distribution is more or less similar! Thats great! Our crime class is indeed 8 times that it was before. Now lest run our classifier over this dataset again.
Now lets check the results again!
Wow, now thats a great leap! from 0.59 straight to 0.98!
Although our classifier is could probably be overfitted. What is overfitting? When the classifier works extremely well in the training set, i.e if plotted the decision boundary or the class separator will be an exact tight match. Kind of like this :
Whats the problem then? The thing is, overfitted classifiers works best only on the training set. Give them a complete new test set and it will fail spectacularly!
How to know whether your classifier is overfitted or not? The scientific way is to have three datasets, one for testing, one for training, and another for validating, and then run the model over the validation dataset to get its performance. Now we only used two datasets, and since we have low number of data we do not have the luxury of having a third split. No worries, for such situations we have cross-validation.
Cross-validation re-uses the same data we used for training and testing. But wait, isn’t that we explicitly want to avoid? Turns out, a cross-validator runs on a k-fold method, where k is just a number denoting the number of splits. For example, if we have a 5-fold cross-validation, the model will be randomly divided into 5 sets, and N th set would be validated using model trained over the remaining N-1 sets of data.
As always, RTexTools provides us with this nifty function too.
We specify our earlier container, algorithm, and the seed value which is basically a numeric value to help the function to randomize the sets. It can be anything according to your choice.
So, the results are also quite promising. Mean accuracy reported is 98%. In Kaggle competitions though, people win or lose by this small margin of 98% to 99%.
Conclusion : A lot of modifications and tweaking can be done on SVM to make it perform even better. Using Linear kernels, putting regularizations would also help in making the accuracy better. Check these blogs for more info on Linear Kernels.
These same operations can also be done in Python. I just love those Jupyter notebooks, be sure to check them out!
Happy data hunting!