cwong8.github.io

MySQL Employees Sample Database

Using the MySQL Employees sample database to practice and learn SQL/MySQL before using Python and R to analyze queried data. Information about the database can be found at https://dev.mysql.com/doc/employee/en/. The database can be downloaded at https://github.com/datacharmer/test_db.

Tools used

Package pandas for dataframes and functions such as read_sql_query
Package numpy for general math functions
Package MySQLdb to connect to MySQL
Package scipy.stats for statistical testing
Packages matplotlib and seaborn for plotting

Creating a connection to MySQL, replace “yourusername” and “yourpassword” with the username/password to your MySQL.

import os
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

# Server connection to MySQL:
import MySQLdb
conn = MySQLdb.connect(host= "localhost",
                  user="yourusername",
                  passwd="yourpassword",
                  db="employees")

x = conn.cursor()

Employee promotions

In this section, I answer some questions regarding promotions within this fictional company.

How many promotions were given? To how many employees?
Has any employee been promoted more than once? What is the maximum times an employee has been promoted?
How many employees were never promoted or left before being promoted?
What is the average time before an employee is promoted? Is there noticeable discrimination based on gender or age?

In MySQL I created a view named Promotions that I will reference but views cannot be created in Python:

CREATE OR REPLACE VIEW Promotions AS
SELECT employees.emp_no, birth_date, first_name, last_name, gender, a.from_date AS hire_date, a.title AS hire_title, a.to_date AS promotion_date, b.title AS promoted_to, b.to_date AS end_date
FROM employees JOIN titles a ON employees.emp_no = a.emp_no JOIN titles b ON a.emp_no = b.emp_no
WHERE a.to_date = b.from_date AND a.title != b.title;

1. How many promotions were given? To how many employees?

### Finding all promotions
promotions = pd.read_sql_query("""
SELECT employees.emp_no, birth_date, first_name, last_name, gender, a.from_date AS hire_date, a.title AS hire_title, a.to_date AS promotion_date, b.title AS promoted_to, b.to_date AS end_date
FROM employees JOIN titles a ON employees.emp_no = a.emp_no JOIN titles b ON a.emp_no = b.emp_no
WHERE a.to_date = b.from_date AND a.title != b.title;
""", conn)
promotions.head()

	emp_no	birth_date	first_name	last_name	gender	hire_date	hire_title	promotion_date	promoted_to	end_date
0	10004	1954-05-01	Chirstian	Koblick	M	1986-12-01	Engineer	1995-12-01	Senior Engineer	9999-01-01
1	10005	1955-01-21	Kyoichi	Maliniak	M	1989-09-12	Staff	1996-09-12	Senior Staff	9999-01-01
2	10007	1957-05-23	Tzvetan	Zielinski	F	1989-02-10	Staff	1996-02-11	Senior Staff	9999-01-01
3	10009	1952-04-19	Sumant	Peac	F	1985-02-18	Assistant Engineer	1990-02-18	Engineer	1995-02-18
4	10009	1952-04-19	Sumant	Peac	F	1990-02-18	Engineer	1995-02-18	Senior Engineer	9999-01-01

Notice how some employees have been promoted more than once (emp_no = 10009)

### Number of promotions
pd.read_sql_query("""
SELECT COUNT(*)
FROM Promotions;
""", conn)

	COUNT(*)
0	143284

### Number of employees who received promotions
pd.read_sql_query("""
SELECT COUNT(DISTINCT(emp_no))
FROM Promotions;
""", conn)

	COUNT(DISTINCT(emp_no))
0	140270

We find that 143,294 promotions were given by this company to 140,270 employees. There are more promotions given than employees which leads into the next question.

2. Has any employee been promoted more than once? What is the maximum times an employee has been promoted?

### Finding employees who have been promoted more than once
df = pd.read_sql_query("""
SELECT emp_no, COUNT(emp_no) AS times_promoted
FROM Promotions
GROUP BY emp_no
HAVING COUNT(*) > 1
""", conn)
df.head()

	emp_no	times_promoted
0	10009	2
1	10066	2
2	10258	2
3	10451	2
4	10571	2

### Finding how many employees were promoted more than once
pd.read_sql_query("""
SELECT COUNT(*)
FROM
(SELECT emp_no, COUNT(emp_no) AS times_promoted
FROM Promotions
GROUP BY emp_no
HAVING COUNT(*) > 1) AS a;
""", conn)

	COUNT(*)
0	3014

### Finding the maximum number of promotions by a single employee
pd.read_sql_query("""
SELECT MAX(times_promoted)
FROM 
(SELECT emp_no, COUNT(emp_no) AS times_promoted
FROM Promotions
GROUP BY emp_no
HAVING COUNT(*) > 1) AS a;
""", conn)

	MAX(times_promoted)
0	2

We find that 3,014 employees have been promoted twice, and that no employees have ever been promoted three times.

3. How many employees were never promoted or left before being promoted?

### Finding all employees who have never been promoted or left before a promotion
df = pd.read_sql_query("""
SELECT employees.emp_no, birth_date, first_name, last_name, gender, hire_date, title, from_date, to_date
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no
HAVING COUNT(employees.emp_no) = 1;
""", conn)
df.head()

	emp_no	birth_date	first_name	last_name	gender	hire_date	title	from_date	to_date
0	10001	1953-09-02	Georgi	Facello	M	1986-06-26	Senior Engineer	1986-06-26	9999-01-01
1	10002	1964-06-02	Bezalel	Simmel	F	1985-11-21	Staff	1996-08-03	9999-01-01
2	10003	1959-12-03	Parto	Bamford	M	1986-08-28	Senior Engineer	1995-12-03	9999-01-01
3	10006	1953-04-20	Anneke	Preusig	F	1989-06-02	Senior Engineer	1990-08-05	9999-01-01
4	10008	1958-02-19	Saniya	Kalloufi	M	1994-09-15	Assistant Engineer	1998-03-11	2000-07-31

### Finding how many employees that have never been promoted or left before a promotion
pd.read_sql_query("""
SELECT COUNT(*)
FROM
(SELECT employees.emp_no, birth_date, first_name, last_name, gender, hire_date, title, from_date, to_date
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no
HAVING COUNT(employees.emp_no) = 1) AS a;
""", conn)

	COUNT(*)
0	159754

We find that 159,754 employees have never been promoted or left before a promotion.

4. What is the average time before an employee is promoted? Is there noticeable discrimination based on gender or age?

### Finding the time before an employee is promoted
df = pd.read_sql_query("""
SELECT emp_no, gender, TIMESTAMPDIFF(DAY, hire_date, promotion_date) AS days_to_promotion
FROM promotions;
""", conn)
df.head()

	emp_no	gender	days_to_promotion
0	10004	M	3287
1	10005	M	2557
2	10007	F	2557
3	10009	F	1826
4	10009	F	1826

### Grouping on gender and taking the average
pd.read_sql_query("""
SELECT gender, AVG(days_to_promotion)
FROM
(SELECT emp_no, gender, TIMESTAMPDIFF(DAY, hire_date, promotion_date) AS days_to_promotion
FROM promotions) AS a
GROUP BY gender;
""", conn)

	gender	AVG(days_to_promotion)
0	M	2473.6378
1	F	2469.7197

Let us visualize the distributions of days until promotion for males and females.

# Group by gender
df1 = df.groupby(['gender'])

# Plot density plots
sns.kdeplot(df1.get_group('F')['days_to_promotion'])
sns.kdeplot(df1.get_group('M')['days_to_promotion'])

<matplotlib.axes._subplots.AxesSubplot at 0x22ea5ff19b0>

png

We see they have roughly the same distribution and similar peaks, leading me to conclude that there is no discrimination between genders. Something to note is that some employees have the same number of days before a promotion (e.g. emp_no = 10005 and 10007) and the number of days is very close to a multiple of days in a year (365 days approximately). This leads me to believe that employees are promoted on or near their hire dates, most likely at their yearly reviews.

### Finding the age of employees when they were promoted
df = pd.read_sql_query("""
SELECT emp_no, gender, birth_date, promotion_date, TIMESTAMPDIFF(YEAR, birth_date, promotion_date) AS age_at_promotion
FROM promotions;
""", conn)
df.head()

	emp_no	gender	birth_date	promotion_date	age_at_promotion
0	10004	M	1954-05-01	1995-12-01	41
1	10005	M	1955-01-21	1996-09-12	41
2	10007	F	1957-05-23	1996-02-11	38
3	10009	F	1952-04-19	1990-02-18	37
4	10009	F	1952-04-19	1995-02-18	42

### Grouping on gender and taking the average
pd.read_sql_query("""
SELECT gender, AVG(age_at_promotion)
FROM 
(SELECT emp_no, gender, birth_date, promotion_date, TIMESTAMPDIFF(YEAR, birth_date, promotion_date) AS age_at_promotion
FROM promotions) AS a
GROUP BY gender;
""", conn)

	gender	AVG(age_at_promotion)
0	M	38.1015
1	F	38.0952

Let us visualize the distributions of ages of promoted employees for males and females.

# Group by gender
df2 = df.groupby(['gender'])

# Plot density plots
sns.kdeplot(df2.get_group('F')['age_at_promotion'])
sns.kdeplot(df2.get_group('M')['age_at_promotion'])

<matplotlib.axes._subplots.AxesSubplot at 0x1a2a4e077b8>

png

We see a much nicer and parametric distribution compared to the distribution of days until promotions. This bell curve distribution shape implies a Gaussian/normal distribution which we can test for. Assumption for a normality test such as the Shapiro-Wilk test are that observations are independent and identically distributed (iid) which is a fair assumption given observations are different employees for the most part. Some employees show up multiple times because they were promoted more than once (e.g. emp_no = 10009). We can obtain the first promotion for employees with multiple promotions by grouping on their employee id.

# Group on emp_no to obtain the age at first promotion for all employees
df = pd.read_sql_query("""
SELECT emp_no, gender, birth_date, promotion_date, TIMESTAMPDIFF(YEAR, birth_date, promotion_date) AS age_at_promotion
FROM promotions
GROUP BY emp_no;
""", conn)
df.head()

	emp_no	gender	birth_date	promotion_date	age_at_promotion
0	10004	M	1954-05-01	1995-12-01	41
1	10005	M	1955-01-21	1996-09-12	41
2	10007	F	1957-05-23	1996-02-11	38
3	10009	F	1952-04-19	1990-02-18	37
4	10012	M	1960-10-04	2000-12-18	40

# Testing for normality
stat, pF = stats.normaltest(df.groupby('gender').get_group('F')['age_at_promotion'])
stat, pM = stats.normaltest(df.groupby('gender').get_group('M')['age_at_promotion'])
{'gender': ['M', 'F'], 'p-values': [pM, pF]}

{'gender': ['M', 'F'], 'p-values': [0.0, 0.0]}

# Quantile-quantile (qq) plot of the residuals
res = stats.probplot(df2.get_group('F')['age_at_promotion'], plot = plt)

png

We see that our p-values for both genders is 0 so we reject our null hypothesis that the distribution is normal and conclude that our distribution is not Gaussian/normal. We can see from the above qq-plot that the tails of our distribution are non-existent. This means the company is not promoting (or hiring) anyone who is under a minimum age or above a maximum age (i.e. they do not promote (or hire) teenagers or people close to retirement age).

Past employees

Let’s take a look at the past employees at this company.

How many past employees does this company have?
Find the ages of new hires. What is the average hire age? Plot the distribution.
Find the ages of employees when they left the company. What is the average age of employees leaving the company? Plot the distribution.
Combine the previous two questions to find the time spent at the company by past employees. Any interesting observations?

1. How many past employees does this company have?

### Finding employees who have left the company. Current employees have to_date set to '9999-01-01'
df = pd.read_sql_query("""
SELECT employees.*, to_date
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no
HAVING MAX(to_date) < DATE '9999-01-01';
""", conn)
df.head()

	emp_no	birth_date	first_name	last_name	gender	hire_date	to_date
0	10008	1958-02-19	Saniya	Kalloufi	M	1994-09-15	2000-07-31
1	10011	1953-11-07	Mary	Sluis	F	1990-01-22	1996-11-09
2	10015	1959-08-19	Guoxiang	Nooteboom	M	1987-07-02	1993-08-22
3	10021	1960-02-20	Ramzi	Erde	M	1988-02-10	2002-07-15
4	10025	1958-10-31	Prasadram	Heyers	M	1987-08-17	1997-10-15

### Finding how many past employees there are
pd.read_sql_query("""
SELECT COUNT(DISTINCT(emp_no))
FROM
(SELECT employees.*, to_date
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no
HAVING MAX(to_date) < DATE '9999-01-01') AS a;
""", conn)

	COUNT(DISTINCT(emp_no))
0	59900

We find that 59,900 employees have left this company.

2. Find the ages of new hires. What is the average hire age? Plot the distribution.

### Finding the hire age of all employees
new_hire = pd.read_sql_query("""
SELECT employees.emp_no, birth_date, first_name, last_name, gender, hire_date, TIMESTAMPDIFF(YEAR, birth_date, hire_date) AS hire_age
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no;
""", conn)
df.head()

	emp_no	birth_date	first_name	last_name	gender	hire_date	to_date	leaving_age
0	10008	1958-02-19	Saniya	Kalloufi	M	1994-09-15	2000-07-31	42
1	10011	1953-11-07	Mary	Sluis	F	1990-01-22	1996-11-09	43
2	10015	1959-08-19	Guoxiang	Nooteboom	M	1987-07-02	1993-08-22	34
3	10021	1960-02-20	Ramzi	Erde	M	1988-02-10	2002-07-15	42
4	10025	1958-10-31	Prasadram	Heyers	M	1987-08-17	1997-10-15	38

### Finding the average hire age
pd.read_sql_query("""
SELECT AVG(hire_age)
FROM
(SELECT employees.emp_no, birth_date, first_name, last_name, gender, hire_date, TIMESTAMPDIFF(YEAR, birth_date, hire_date) AS hire_age
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no) AS a;
""", conn)

	AVG(hire_age)
0	30.9972

We find that employees are hired at an average age of approximately 31 years old.

### Plot the distribution of ages for new hires
a = plt.hist(new_hire['hire_age'], bins = (max(new_hire['hire_age']) - min(new_hire['hire_age'])))

png

new_hire

(array([ 1461.,  4410.,  7059.,  9319., 11666., 14004., 15477., 17389.,
        18849., 19995., 20882., 21984., 22357., 21700., 18577., 16190.,
        13483., 11326.,  9188.,  7339.,  5765.,  4323.,  3067.,  2037.,
         1266.,   631.,   280.]),
 array([20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32.,
        33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44., 45.,
        46., 47.]),
 <a list of 27 Patch objects>)

3. Find the ages of employees when they left the company. What is the average age of employees leaving the company? Plot the distribution.

### Finding the age of employees when they left the company
past_employees = pd.read_sql_query("""
SELECT employees.*, to_date, TIMESTAMPDIFF(YEAR, birth_date, to_date) AS leaving_age
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no
HAVING MAX(to_date) < '9999-01-01';
""", conn)
df.head()

	emp_no	birth_date	first_name	last_name	gender	hire_date	to_date	leaving_age
0	10008	1958-02-19	Saniya	Kalloufi	M	1994-09-15	2000-07-31	42
1	10011	1953-11-07	Mary	Sluis	F	1990-01-22	1996-11-09	43
2	10015	1959-08-19	Guoxiang	Nooteboom	M	1987-07-02	1993-08-22	34
3	10021	1960-02-20	Ramzi	Erde	M	1988-02-10	2002-07-15	42
4	10025	1958-10-31	Prasadram	Heyers	M	1987-08-17	1997-10-15	38

### Finding the average age of employees who left the company
pd.read_sql_query("""
SELECT AVG(leaving_age)
FROM
(SELECT employees.*, to_date, TIMESTAMPDIFF(YEAR, birth_date, to_date) AS leaving_age
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no
HAVING MAX(to_date) < '9999-01-01') AS a;
""", conn)

	AVG(leaving_age)
0	37.9825

We find that on average employees leave when they are approximately 38 years old.

### Plot the distribution of ages for employees who have left
a = plt.hist(past_employees['leaving_age'], bins = (max(past_employees['leaving_age']) - min(past_employees['leaving_age'])))

png

Let’s take a look at the distribution of ages for new and past employees.

### Both distributions
a = plt.hist(new_hire['hire_age'], bins = (max(new_hire['hire_age']) - min(new_hire['hire_age'])))
a = plt.hist(past_employees['leaving_age'], bins = (max(past_employees['leaving_age']) - min(past_employees['leaving_age'])))

png

We see that there are a lot more new hires compared to employees leaving the company. This could indicate that the company has grown significantly.

4. Combine the previous two questions to find the time spent at the company by past employees. Any interesting observations?

### Finding past employees of this company and their ages at hire and their last day
df = pd.read_sql_query("""
SELECT a.*, b.to_date, b.leaving_age, CONCAT(TIMESTAMPDIFF(YEAR, a.hire_date, b.to_date), ' years, ', MOD(TIMESTAMPDIFF(MONTH, a.hire_date, b.to_date), 12), ' months') AS time_at_company, TIMESTAMPDIFF(DAY, a.hire_date, b.to_date) AS company_days
FROM
(SELECT employees.emp_no, birth_date, first_name, last_name, gender, hire_date, TIMESTAMPDIFF(YEAR, birth_date, hire_date) AS hire_age
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no) AS a
INNER JOIN
(SELECT employees.*, from_date, to_date, TIMESTAMPDIFF(YEAR, birth_date, to_date) AS leaving_age
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no
HAVING MAX(to_date) < '9999-01-01') AS b ON a.emp_no = b.emp_no;
""", conn)
df.head()

	emp_no	birth_date	first_name	last_name	gender	hire_date	hire_age	to_date	leaving_age	time_at_company	company_days
0	10008	1958-02-19	Saniya	Kalloufi	M	1994-09-15	36	2000-07-31	42	5 years, 10 months	2146
1	10011	1953-11-07	Mary	Sluis	F	1990-01-22	36	1996-11-09	43	6 years, 9 months	2483
2	10015	1959-08-19	Guoxiang	Nooteboom	M	1987-07-02	27	1993-08-22	34	6 years, 1 months	2243
3	10021	1960-02-20	Ramzi	Erde	M	1988-02-10	27	2002-07-15	42	14 years, 5 months	5269
4	10025	1958-10-31	Prasadram	Heyers	M	1987-08-17	28	1997-10-15	38	10 years, 1 months	3712

Let’s take a look at the distribution of time past employees have worked at the company.

### Plotting the distribution with a line at the peak
p = sns.kdeplot(df['company_days'])
x,y = p.get_lines()[0].get_data()
index = np.argmax(y)
plt.vlines(x[index], 0, y[index])
plt.show()

png

# Summary statistics on days spent at the company
pd.read_sql_query("""
SELECT MIN(company_days), MAX(company_days), AVG(company_days)
FROM
(SELECT a.*, b.to_date, b.leaving_age, CONCAT(TIMESTAMPDIFF(YEAR, a.hire_date, b.to_date), ' years, ', MOD(TIMESTAMPDIFF(MONTH, a.hire_date, b.to_date), 12), ' months') AS time_at_company, TIMESTAMPDIFF(DAY, a.hire_date, b.to_date) AS company_days
FROM
(SELECT employees.emp_no, birth_date, first_name, last_name, gender, hire_date, TIMESTAMPDIFF(YEAR, birth_date, hire_date) AS hire_age
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no) AS a
INNER JOIN
(SELECT employees.*, from_date, to_date, TIMESTAMPDIFF(YEAR, birth_date, to_date) AS leaving_age
FROM employees JOIN titles ON employees.emp_no = titles.emp_no
GROUP BY employees.emp_no
HAVING MAX(to_date) < '9999-01-01') AS b ON a.emp_no = b.emp_no) AS c;
""", conn)

	MIN(company_days)	MAX(company_days)	AVG(company_days)
0	0	6388	2555.3828

We see that peak density occurs at approximately 1880 days. On average employees spend approximately 2555 days working at the company with the maximum being 6388 days (about 17 years, 5 months). Some employees have worked less than a full day.

Christopher Wong

Employee promotions and past employees