Understanding Labour Market of India01 Feb 2016
From early January we started tackling an interesting problem, Understanding Labour Market of our country with the help of AMCAT data, for IKDD CODS DataChallenge 2016. For the uninitiated, AMCAT or Aspiring Minds Computer Adaptive Test is a standardized test taken by Indian graduates to get better job opportunities. AMCAT released a subset of their candidate pool dataset for the year 2015, and posed the challenge to model an algorithm to predict Salary obtained by future candidates. The challenge was open till 15th January, 2016 which was the abstract submission date, although surprisingly they kept the leaderboard open till 31st January, which was the final paper submission deadline. Our submission was ranked first on the leaderboard (SK_Data) till 15th Jan, but we slipped one position at the very end of the challenge by a single digit!
The dataset contains various information about a candidate, such as profile information including past academic details, as well as the employment outcomes. Some specific features of the dataset :
- Scores of AMCAT, containing Quants, English, Logical, and Domain scores
- Personality Information like Date of birth, gender, etc
- University information like GPA obtained, College Tier (set by AMCAT), College City Tier (also set by AMCAT)
- Pre-university information like 10th grade marks, 12th grade marks and board information
- Employment outcomes like Job Title, Job Location and Salary
Our target is to predict the Salary. Lets see how it goes! In my last blog post, I used R tools, this time I am using Python tools, specially scikit-learn library. While I like to keep myself open regarding technology usage, Python and its stack really blew my mind! Jupyter Notebooks are just awesome! For visualization we have used both Seaborn and Matplotlib. Detailed exploration and results are provided in the following IPython Notebook.
Analysis of Data
The dataset has 27 features in total, and 3,998 rows. The Salary was distributed to a wide range of values :
As you can see, most of the Salary is accumulated around 3 to 5 LPA region, which is the average salary obtained by Indian candidates. Few outliers are present above 10 LPA.
Lets see how the academic scores stand out, such as 10th percentage, 12th percentage and College GPA w.r.t Salary.
Clearly from the data, the scores are slightly correlated with Salary (Pearson Correlation coefficient 0.17), but an important observation achieved from the College GPA scores : all Salary values are above 55 %, which indicates it is the minimum cutoff required!
Ok, how does AMCAT scores stand up with Salary ?
After extensive research, we found out that Recruiters tend to have a cutoff value for each section. This cutoff value is private for each companies, but searching through some AMCAT posts and public blog posts we got to know about the range of these cutoffs. Therefore, we built our own scoring system on 12 based on cutoffs.
Similarly, AMCAT also provided personality test scores based on Big Five traits, which was a normalized score between -1 to +1. From AMCAT blogs, we got to know +0.44 represents candidates in the top 33% pool, and vice versa -0.44 represents candidates in the bottom pool. We created another scoring system for Big Five traits based on these cutoffs on a scale of 16. (3 levels X 5 traits)
Graduation Time : Time taken by a candidate to Graduate has a strong positive effect over the salary, which is calculated from their DOB to the year they graduate. We have seen this feature has a marked trend, as most high paying jobs were bagged by candidates who have graduated in 21 to 24 years, with highest salaries received by the ones in 22 - 23 years. Typically in Indian context candidates graduate B.Tech in 22 years, so it is evident that highest salaries are bagged by freshers who have not lost any year due to backlogs in any academic level.
Model Building and Selection
We were primarily intersted in the results from Linear Models and Random Forests. Since this is a relatively small dataset, Linear Models tend to work well with this type of data. After running the models on the training set, these were the results obtained :
|Linear SVM Regression||227827.2|
As it is relevant from the output, Lasso Regression is giving the best results. Lasso or Least Absolute Shrinkage Selection Operator is a regression method which involves penalizing the absolute size of regression coefficients. That is especially suitable when we want a bit of automatic feature selection, as it shrinks the non relevant feature coefficients to zero.
On submission of the test data output on Aspiring Minds Leaderboard, we achieved a MSE score of 13183.6.
These are some of the informational insights we gleamed from the dataset :
College Tier : The dataset contained a field of College Tier where AMCAT annotated each college with a score of 1 and 2. The annotations have been computed from the average AMCAT scores obtained by the students in the college/university. Colleges with an average score above a threshold as tagged as 1 and others as 2.
In our observation, Tier 1 college candidates have secured more salary in average than Tier 2 college, which coincides with the general trend.
Similarly, on City Tier as per AMCAT annotation, Tier 0 City college students get better pay package, which corresponds with the ground truth that since a metro city has more Industries and hence the colleges enjoy more exposure by the recruiters.
- Specialization has a marked effect on the salary, with Computer Science graduates grabbing the most lucrative deals, followed by Electronics & Communication Engineering and Information Technology. The core streams, such as Mechanical, Civil, Electrical, Instrumentation Engineering are left out having lower salary than the rest, which reflects the current trend of the country, where there is a severe shortage
- Top Job Location : Bangalore. Unsurprisingly, India’s first choice of job location is Bangalore, the Silicon Valley of the East. Followed by Noida (Gurgaon), Hyderabad, Pune and Chennai. This shows us how the Indian Labour Market operates currently based on these top cities.
- Top Designations : According to the dataset, most of the freshers land in the job posts of “Software Engineer” or otherwise also known by “Associate Software Engineer” and “Programmer Analyst”, which corresponds to more than 32% of the pool, followed by “Systems Engineer”, “Java Developer” and “Software Test Engineer”
- AMCAT scores distribution : Recruiters are found to select candidates based on a cutoff AMCAT score, which we corroborated through our research on the dataset. Thus, if a Recruiter sets 600 to be the top cutoff for Quants Section, then both candidates scoring 600 and 700 would be judged on the same level. Also, we tried to generate an insight into which job post requires the mean highest cutoffs, and found “Systems Engineer” and “Senior Software Engineer” posts demand the highest cutoffs for Quant and Logical. While every job title demands a mean cutoff with an order of Quants > English > Logical, but one particular job post, “Technical Support Engineer”, demands high English scores which is evident as the candidates need to be fluent in English to handle overseas clients.
We enjoyed a lot in playing with this dataset. Got to learn and sharpen our skills, and observed that python tools for exploration and modelling are really cool! Most important takeaway from this exercise is that Ground Truth and Feature Engineering are the most important, algorithm is secondary. Special thanks to Hangtwenty’s Dive into Machine Learning List for providing an excellent resource of tutorials and best practices. If you want to follow what we have done, you can find the full code sample and dataset at my repository. Next stop, learning Deep Learning with TensorFlow!
Happy Data Hunting