Prediction of Students Grades Using Linear Regression Techniques¶

Introduction¶

In this notebook,Linear Regression is implemented to predict student's grades. Through machine learning process, cleaning the data, interpretation and presentation of the results is done.¶

Objectives:¶

To accurately predict the grades of student
To read, examine data and visualize the results
To check the prediction accuracy

Linear regression¶

import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle

Data prediction and Prediction Accuracy¶

# data =pd.read_csv("student-mat.csv",sep=";")
data=data[["G1","G2","G3","studytime","failures","absences","freetime","age"]]
predict="G3"

x= np.array(data.drop([predict],1))
y= np.array(data[predict])

x_train,x_test,y_train,y_test = sklearn.model_selection.train_test_split(x,y,test_size= .2)

linear =linear_model.LinearRegression()
linear.fit(x_train, y_train)
linear.score(x_test,y_test)
acc= linear.score(x_test,y_test)

prediction= linear.predict(x_test)

for i in range (len(prediction)):
  print(prediction[i],x_test[i],y_test[i])

print(acc)

6.260893784147028 [ 8  8  1  3  2  5 17] 10
11.90871667058993 [14 12  1  0  3  3 18] 12
3.197838171469524 [ 3  5  2  1  8  3 18] 5
8.601631556482669 [ 8  9  2  0  8  3 16] 10
7.620669206584592 [ 9  9  2  2 11  5 20] 9
12.392311876683758 [12 13  2  0  0  3 18] 13
18.94421938692939 [19 18  2  0  2  3 15] 18
7.381780254345536 [ 9  7  2  0 18  4 16] 6
6.926733332851889 [ 8  8  2  0  0  3 18] 0
12.964338509643078 [11 13  2  0  2  4 15] 14
15.394385637353194 [14 15  2  0  0  5 15] 15
6.321494795647119 [ 7  7  3  0  6  4 17] 7
13.656547943328508 [12 14  3  0  1  4 17] 15
11.41772128544735 [11 12  1  0  0  4 18] 10
14.07354342315056 [15 14  2  0  8  2 18] 14
8.126292426169435 [11  8  2  0  2  4 15] 8
6.423440138589809 [ 8  7  2  0  0  4 16] 8
10.502491232960324 [11 11  2  0  2  4 18] 11
15.09393410455178 [16 15  3  0  0  3 17] 15
8.24799567934133 [10  9  4  0  0  4 18] 0
14.352776515337933 [14 14  1  0  4  4 16] 14
16.067926612037056 [14 16  1  0  3  4 17] 16
14.879152308877408 [14 15  2  0  0  2 16] 16
4.788017887749154 [ 5  6  2  0  6  3 18] 6
8.07751492355839 [ 7  9  4  0  0  2 15] 0
20.374280467393405 [18 19  1  0 10  5 15] 19
10.46472286954164 [11 11  4  0  0  2 16] 11
9.388580609534642 [ 8 10  2  0  0  3 15] 12
17.31792561260945 [16 17  3  0  0  4 16] 17
11.222929182520827 [11 11  4  0  8  4 15] 10
9.04958094735283 [ 8  9  1  1 38  3 19] 8
6.979112497059594 [ 8  8  3  0  2  3 18] 10
14.82382086255844 [15 15  3  0  0  2 17] 15
12.745956380151783 [12 13  2  0  4  3 17] 13
6.116655282952377 [ 7  7  3  0  0  3 16] 8
4.0855415908414585 [ 6  5  1  1 14  4 18] 5
-1.8496254464063755 [ 4  0  1  2  0  3 16] 0
15.232796812345436 [14 15  2  0  4  2 15] 15
11.901510947013783 [12 12  1  0  2  3 16] 14
7.682749052084088 [ 6  9  1  2 14  2 16] 8
4.811484426092359 [ 8  6  2  2  2  3 15] 5
10.469419953133155 [ 8 11  2  0  0  4 15] 11
13.990660213381533 [14 14  3  0  4  3 17] 14
7.99947619070115 [ 9  9  1  2  8  3 16] 9
15.475524045641968 [17 15  1  0  2  2 16] 15
9.68368700703888 [11 10  2  0  0  3 16] 10
12.153210107539119 [11 12  2  0 12  3 16] 11
14.260910349046888 [15 14  2  1 20  4 19] 13
12.798983596065147 [10 13  4  0  6  3 15] 13
15.389031875729332 [17 15  1  0  4  2 17] 16
13.625581675207448 [14 13  1  0  8  3 15] 13
15.167925383545624 [14 15  2  0  0  3 15] 15
10.626303068222711 [ 9 11  2  0  0  4 15] 12
8.185674232153994 [ 8  9  2  0  4  4 18] 10
4.171718915617754 [ 6  5  1  0  7  4 18] 6
12.938295001927733 [10 13  1  0 12  4 17] 12
9.7540779522158 [12 10  2  0  2  3 17] 11
5.769983527699345 [ 6  7  2  0  0  1 16] 0
15.464776582530114 [15 15  2  0  2  5 16] 16
16.606183945509024 [17 16  2  0  0  3 15] 17
12.955866711154703 [11 13  4  0  6  3 15] 14
8.638303170126761 [ 8  9  1  0  8  3 16] 10
6.080645126464393 [ 9  8  1  3  6  2 18] 10
4.063783956056616 [ 6  5  2  0  4  3 17] 6
4.895859387587815 [ 7  7  2  2  4  2 19] 9
17.17905427848911 [16 17  2  0  0  4 17] 17
7.7540998533635825 [ 7  9  1  1  2  4 17] 8
9.390491165768143 [10 10  1  0  4  3 18] 10
9.382891768019896 [ 9 10  3  0  9  3 18] 9
15.707749567033513 [15 15  2  0 10  4 16] 15
7.696335160648589 [ 8  8  2  0  8  2 15] 6
9.634514502475996 [ 9 10  2  0  2  3 15] 10
13.376919849189296 [15 13  2  0  9  4 18] 15
20.196178911689813 [18 19  1  0  6  5 15] 19
10.51940626415495 [12 11  2  0  0  2 17] 12
5.91974464382894 [ 7  6  2  0 10  4 15] 6
9.04728774624044 [ 9  9  2  0  8  4 15] 9
14.296348319244375 [15 14  3  0  2  2 15] 15
3.6676553488463046 [ 6  5  1  3 16  5 17] 5
0.7829729137859951

import pandas as pd
import numpy as np
np.random.seed(42)

 
# Matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
%matplotlib inline

import matplotlib
matplotlib.rcParams['font.size'] = 16
matplotlib.rcParams['figure.figsize'] = (9, 9)

import seaborn as sns

from IPython.core.pylabtools import figsize

# Scipy helper functions
from scipy.stats import percentileofscore
from scipy import stats

# Standard ML Models for comparison
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR

# Splitting data into training/testing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error

# Distributions
import scipy

Read and Examine Data¶

# Read in class scores
df = pd.read_csv("student-mat.csv",sep=";")

# Filter out grades that were 0
df = df[~df['G3'].isin([0, 1])]

df = df.rename(columns={'G3': 'Grade'})

df.head()

df.shape

(357, 33)

Description in Numerical Columns¶

df.describe()

Counting Values for Categorical Columns¶

# Print the value counts for categorical columns
for col in df.columns:
    if df[col].dtype == 'object':
        print('\nColumn Name:', col,)
        print(df[col].value_counts())

Column Name: school
GP    315
MS     42
Name: school, dtype: int64

Column Name: sex
F    185
M    172
Name: sex, dtype: int64

Column Name: address
U    279
R     78
Name: address, dtype: int64

Column Name: famsize
GT3    250
LE3    107
Name: famsize, dtype: int64

Column Name: Pstatus
T    318
A     39
Name: Pstatus, dtype: int64

Column Name: Mjob
other       127
services     94
teacher      54
at_home      50
health       32
Name: Mjob, dtype: int64

Column Name: Fjob
other       196
services    100
teacher      26
health       18
at_home      17
Name: Fjob, dtype: int64

Column Name: reason
course        126
reputation     99
home           97
other          35
Name: reason, dtype: int64

Column Name: guardian
mother    248
father     82
other      27
Name: guardian, dtype: int64

Column Name: schoolsup
no     307
yes     50
Name: schoolsup, dtype: int64

Column Name: famsup
yes    219
no     138
Name: famsup, dtype: int64

Column Name: paid
no     184
yes    173
Name: paid, dtype: int64

Column Name: activities
yes    180
no     177
Name: activities, dtype: int64

Column Name: nursery
yes    286
no      71
Name: nursery, dtype: int64

Column Name: higher
yes    343
no      14
Name: higher, dtype: int64

Column Name: internet
yes    299
no      58
Name: internet, dtype: int64

Column Name: romantic
no     245
yes    112
Name: romantic, dtype: int64

Grades Distribution¶

df['Grade'].describe()

count    357.000000
mean      11.523810
std        3.227797
min        4.000000
25%        9.000000
50%       11.000000
75%       14.000000
max       20.000000
Name: Grade, dtype: float64

df['Grade'].value_counts()

10    56
11    47
15    33
8     32
12    31
13    31
9     28
14    27
16    16
6     15
18    12
7      9
5      7
17     6
19     5
20     1
4      1
Name: Grade, dtype: int64

Visualization:Distribution of Final Grades¶

# Bar plot of grades
plt.bar(df['Grade'].value_counts().index, 
        df['Grade'].value_counts().values,
         fill = 'navy', edgecolor = 'k', width = 1)
plt.xlabel('Grade'); plt.ylabel('Count'); plt.title('Distribution of Final Grades');
plt.xticks(list(range(5, 20)));

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	Grade
0	GP	F	18	U	GT3	A	4	4	at_home	teacher	...	4	3	4	1	1	3	6	5	6	6
1	GP	F	17	U	GT3	T	1	1	at_home	other	...	5	3	3	1	1	3	4	5	5	6
2	GP	F	15	U	LE3	T	1	1	at_home	other	...	4	3	2	2	3	3	10	7	8	10
3	GP	F	15	U	GT3	T	4	2	health	services	...	3	2	2	1	1	5	2	15	14	15
4	GP	F	16	U	GT3	T	3	3	other	other	...	4	3	2	1	2	5	4	6	10	10

	age	Medu	Fedu	traveltime	studytime	failures	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	Grade
count	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000	357.000000
mean	16.655462	2.795518	2.546218	1.431373	2.042017	0.271709	3.955182	3.246499	3.098039	1.495798	2.330532	3.549020	6.316527	11.268908	11.358543	11.523810
std	1.268262	1.093999	1.084217	0.686075	0.831895	0.671750	0.885721	1.011601	1.090779	0.919886	1.294974	1.402638	8.187623	3.240450	3.147188	3.227797
min	15.000000	0.000000	0.000000	1.000000	1.000000	0.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	3.000000	5.000000	4.000000
25%	16.000000	2.000000	2.000000	1.000000	1.000000	0.000000	4.000000	3.000000	2.000000	1.000000	1.000000	3.000000	2.000000	9.000000	9.000000	9.000000
50%	17.000000	3.000000	3.000000	1.000000	2.000000	0.000000	4.000000	3.000000	3.000000	1.000000	2.000000	4.000000	4.000000	11.000000	11.000000	11.000000
75%	18.000000	4.000000	3.000000	2.000000	2.000000	0.000000	5.000000	4.000000	4.000000	2.000000	3.000000	5.000000	8.000000	14.000000	14.000000	14.000000
max	22.000000	4.000000	4.000000	4.000000	4.000000	3.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	75.000000	19.000000	19.000000	20.000000