EXPLORATION OF ABSENTEEISM WITH MACHINE LEARNING
Abstract
In other to measure up with the severe rate of business competition and customer satisfaction, there’s one important factor which comes in play with great effect this is;performance of employee: Labor is a great factor of the economy in the sense that we have to one way or the other put in our human efforts to ensure the progress of a business. So, an employee with ineffective performance will disrupt the flow in labor, which will disrupt the flow in production and in turn disrupt the profit, which is bad for business. Now, every organization expects maximum profit and reduced cost management which the ineffectiveness of her employees hinder. One of the greatest controllable factors that affects the effectiveness of the employees is punctuality . Understanding the causes of absenteeism may help to empower an organization with the tools to gain maximum profit and customer satisfaction.
The aim of this paper is to help the human resources department of any organization discover the factors and causes of employees’ absence using computerized technologies. The research conducts data analysis on the absentee database and explored some factors that have a good correlation with absenteeism and gathered insights inferred from the data with proposed proposals to overcome most of them.
This paper is divided into 4 stages the introduction, the data exploration, the model development and the model interpretation.
INTRODUCTION
This paper focuses on extracting the relationship between the general information about the employees (which are stored in the organization database) and the reasons for the absence and to predict their absenteeism rate. By finding this relationship, the organization gains a highly competitive advantage tool that could be used to address the consequences of the employees absence and help human resources management to improve the process of recruitment and crisis management.
There are different ways of exploring your data but in this study we used classification to analyze and explore our data and also to help in our prediction. Classification is an aspect of supervised machine learning in which we train different models with classified data and after the models have been trained we test them with some percentage of our data by using the models to predict new classes and determine the accuracy of the models.
Moreover, four prediction models were built: Naïve Bayes, logistic regression, support vector system and Random Forest. In other to help the HR predict how effective an incoming employee could be and enhance the production rate and customer satisfaction. These models predict the absenteeism with high accuracy.
Data dictionary
The dataset used is the popular “Absenteeism at work” dataset. the dataset was created with records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil. it has also been used in academic research at the Universidade Nove de Julho — Postgraduate Program in Informatics and Knowledge Management.
These are the columns in our dataset and their meaning
1. Individual identification (ID)
2. Reason for absence (ICD).
Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI)
I Certain infectious and parasitic diseases
II Neoplasms
III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism
IV Endocrine, nutritional and metabolic diseases
V Mental and behavioral disorders
VI Diseases of the nervous system
VII Diseases of the eye and adnexa
VIII Diseases of the ear and mastoid process
IX Diseases of the circulatory system
X Diseases of the respiratory system
XI Diseases of the digestive system
XII Diseases of the skin and subcutaneous tissue
XIII Diseases of the musculoskeletal system and connective tissue
XIV Diseases of the genitourinary system
XV Pregnancy, childbirth and the puerperium
XVI Certain conditions originating in the perinatal period
XVII Congenital malformations, deformations and chromosomal abnormalities
XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified
XIX Injury, poisoning and certain other consequences of external causes
XX External causes of morbidity and mortality
XXI Factors influencing health status and contact with health services.
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target).
Feature Engineering
After using
data.corr()
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(),annot=True)
we figured out which of our attributes correlates with our target and dropped 'Month of absence', 'Day of the week', 'Seasons', 'Weight', 'Height', 'Hit target'.
We then categorized the following column thus:
- Age into young employee (25-35), mid age employee(36-45), old employee(above 45).
- Transportation into cheap (100-200), expensive (200-300), very expensive (above 300).
- Distance into close (0-15), far (15 -35), very far (above 35).
- Absenteeism into late (below 0 hours), very late (0-10), extremely late (above 10)
Effect of distance and transportation.
Insight: Both transportation expense and distance has massive effect on our hours of absence, they are aspects that need serious attention.
Proposal: the company should organize a staff bus that will pick employees up for work on time and get them to work on time.
Effect Of Disciplinary Action On Hours Absent
Insight: clearly we can see that punished employees do make it a duty to reduce their absence therefore disciplinary action is a very effective tool to motivate employees’ abundance presence.
proposal: disciplinary actions should be enforced and employees should be openly disciplined so that this will motivate them more.
Effect Of Drinking And Smoking.
Insight: smoking does not to some extent affect the punctuality but every little time spent absent counts. While drinking has a great deal of hours absent it claims.
Proposal: there should impromptu alcohol or any illicit substance check maybe on weekly bases and offenders should be punished.
Effect Of Age On Hours Absent
Insight: mid aged employees are more diligent and more punctual but the young and old employees have alarming numbers of hours absence.
proposal: occasional seminar should be organized to encourage the young employees and the ages between should be the bases for employment.
Effect Of education On Hours Absent
Insight: from the graph we see that the level of education reflects on the punctuality of individuals .
proposal: employees should be encouraged to further education with maybe a promotion or a salary raise.
Model Development
We split our data into test data and train data in the ratio 30:70 taking our target to be the absenteeism category and choosing features that could be gotten from a candidate for employment which are: Reason for absence (any of the listed ailment the candidate might have), Social smoker, Pet, Transportation expense, Distance from Residence to Work, Age, Social drinker, Body mass index, Education, Son.
We used the train set of data with the mentioned features to train the models
Naïve Bayes
Random forest
Logistic regression
Support vector system.
After training we
Model accuracy comparison
Naïve Bayes
[ [ 6 0 15]
[ 0 10 0]
[ 10 7 174 ] ]
- The confusion matrix shows that this model did fair enough in correctly predicting 6 employees to be late and wrongly predicting 15 to be extremely late.
- Correctly predicted 10 employees to be very late and they were actually late.
- It predicted 174 employees to be extremely late and they were actually extremely late and wrongly predicted 10 to be late who were not late and 7 to be very late while they were extremely late.
Logistic regression
[[ 0 0 21]
[ 0 3 7]
[ 0 2 189]]
1. The confusion matrix shows 21 employees wrongly predicted to be extremely late .
2. It correctly predicted 3 employees to be very late and wrongly predicted 7 extremely late.
3. It predicted 189 employees to be extremely late and they were actually extremely late and wrongly predicted 2 to be very late who were extremely late.
Support vector system
[[ 0 0 21]
[ 0 0 10]
[ 0 0 191]]
1. The confusion matrix shows the model wrongly predicted 21 employees extremely.
2. It wrongly predicted 10 employees to be extremely late and they were actually very late.
3. It predicted 191 employees to be extremely late and they were actually extremely late.
Random Forest.
[[ 5 0 16]
[ 0 10 0]
[ 7 0 184]]
- The confusion matrix shows that the model did fair enough in correctly predicting 5 employees to be late and wrongly predicting 16 to be extremely late.
- It correctly predicted 10 employees to be very late and they were actually late.
- It predicted 184 employees to be extremely late and they were actually extremely late and wrongly predicted 7 to be late who were not late.
With an accuracy of 0.900 and a precision of 0.88 which makes this the most suitable model.
Basically, with the help of our model, the HR department of a firm could sieve the applicants who will hinder the flow of business with high absenteeism rate, just by collecting the needed data then run through the model and by that optimize production rate, minimize cost of operation and maximize profit.
This publication is for the project assignment of AI Wednesday cohort, organized by Data Science Nigeria.
Special thanks to everyone who contributed in making the project successful Abdul Quadri, Ojo Olawale Ogheneovo Idoghor Otulagun Daniel Oluwatosin, Haneefah Abdul-Rahman Lekki, David Ogunleye.
A special thanks to Data Science Nigeria for creating the platform to learn Data Science and Artificial Intelligence by creating world-class learning.
And also to our Instructors Daniel Ajisafe, wunmi rose and the entire AI Wednesday Team, we say a big thank you.
Click here for the link to the notebook.