Understanding Labour Market of India

From early January we started tackling an interesting problem, Understanding Labour Market of our country with the help of AMCAT data, for IKDD CODS DataChallenge 2016. For the uninitiated, AMCAT or Aspiring Minds Computer Adaptive Test is a standardized test taken by Indian graduates to get better job opportunities. AMCAT released a subset of their candidate pool dataset for the year 2015, and posed the challenge to model an algorithm to predict Salary obtained by future candidates. The challenge was open till 15th January, 2016 which was the abstract submission date, although surprisingly they kept the leaderboard open till 31st January, which was the final paper submission deadline. Our submission was ranked first on the leaderboard (SK_Data) till 15th Jan, but we slipped one position at the very end of the challenge by a single digit!

The dataset contains various information about a candidate, such as profile information including past academic details, as well as the employment outcomes. Some specific features of the dataset :

Target

Our target is to predict the Salary. Lets see how it goes! In my last blog post, I used R tools, this time I am using Python tools, specially scikit-learn library. While I like to keep myself open regarding technology usage, Python and its stack really blew my mind! Jupyter Notebooks are just awesome! For visualization we have used both Seaborn and Matplotlib. Detailed exploration and results are provided in the following IPython Notebook.

Analysis of Data

The dataset has 27 features in total, and 3,998 rows. The Salary was distributed to a wide range of values :

As you can see, most of the Salary is accumulated around 3 to 5 LPA region, which is the average salary obtained by Indian candidates. Few outliers are present above 10 LPA.

Lets see how the academic scores stand out, such as 10th percentage, 12th percentage and College GPA w.r.t Salary.

Clearly from the data, the scores are slightly correlated with Salary (Pearson Correlation coefficient 0.17), but an important observation achieved from the College GPA scores : all Salary values are above 55 %, which indicates it is the minimum cutoff required!

Ok, how does AMCAT scores stand up with Salary ?

Feature Engineering

After extensive research, we found out that Recruiters tend to have a cutoff value for each section. This cutoff value is private for each companies, but searching through some AMCAT posts and public blog posts we got to know about the range of these cutoffs. Therefore, we built our own scoring system on 12 based on cutoffs.

Similarly, AMCAT also provided personality test scores based on Big Five traits, which was a normalized score between -1 to +1. From AMCAT blogs, we got to know +0.44 represents candidates in the top 33% pool, and vice versa -0.44 represents candidates in the bottom pool. We created another scoring system for Big Five traits based on these cutoffs on a scale of 16. (3 levels X 5 traits)

Graduation Time : Time taken by a candidate to Graduate has a strong positive effect over the salary, which is calculated from their DOB to the year they graduate. We have seen this feature has a marked trend, as most high paying jobs were bagged by candidates who have graduated in 21 to 24 years, with highest salaries received by the ones in 22 - 23 years. Typically in Indian context candidates graduate B.Tech in 22 years, so it is evident that highest salaries are bagged by freshers who have not lost any year due to backlogs in any academic level.

Model Building and Selection

We were primarily intersted in the results from Linear Models and Random Forests. Since this is a relatively small dataset, Linear Models tend to work well with this type of data. After running the models on the training set, these were the results obtained :

Model RMSE
SVM Regression 231166.3
Random Forests 219117.1
Linear SVM Regression 227827.2
Lasso Regression 217693.4
Linear Regression 217737.6
Logistic Regression 269024.1

As it is relevant from the output, Lasso Regression is giving the best results. Lasso or Least Absolute Shrinkage Selection Operator is a regression method which involves penalizing the absolute size of regression coefficients. That is especially suitable when we want a bit of automatic feature selection, as it shrinks the non relevant feature coefficients to zero.

On submission of the test data output on Aspiring Minds Leaderboard, we achieved a MSE score of 13183.6.

Insights

These are some of the informational insights we gleamed from the dataset :

Similarly, on City Tier as per AMCAT annotation, Tier 0 City college students get better pay package, which corresponds with the ground truth that since a metro city has more Industries and hence the colleges enjoy more exposure by the recruiters.

Final Word

We enjoyed a lot in playing with this dataset. Got to learn and sharpen our skills, and observed that python tools for exploration and modelling are really cool! Most important takeaway from this exercise is that Ground Truth and Feature Engineering are the most important, algorithm is secondary. Special thanks to Hangtwenty’s Dive into Machine Learning List for providing an excellent resource of tutorials and best practices. If you want to follow what we have done, you can find the full code sample and dataset at my repository. Next stop, learning Deep Learning with TensorFlow!

Happy Data Hunting