Python Projects for Beginners | Python Projects | Intellipaat

Python Projects for Beginners | Python Projects | Intellipaat


Hey guys, welcome to the session by Intellipaat. So, python is a versatile language which is used for multiple
purposes such as web development, machine learning, and deep learning, and the best
way to get expertise in Python is to work on hands-on projects. So today, we
have come up with this capstone project on Python so that you get a complete
understanding of the data science life cycle.
So, before we start off with the class, do subscribe to our channel so that you get
a notification of our next video. So, let’s go through the agenda. We’ll
start off by implementing some data manipulation operations with the pandas
library, and then we’ll use matplotlib to visualize the underlying data, after
that we’ll implement some machine learning algorithms using Scikit-learn.
Finally, there’ll be a quiz to recap what we have learnt in today’s session.
So, do put down your answers in the chat box so that you know if you have answered it
correctly. Also, you can put down all of your queries in the comment section, we’d love to help you out. So without much delay, let’s start off with the
class. Consider yourself to be a Data Scientist at a prestigious telecom
company, and the name of that company is ‘Neo,’ and they’re facing a major problem
and that basically is that their customers are churning out to other competitors. Now,
you as a Data Scientist at that particular company, have to make sure to
stop this churning out and also find out the reasons why customers are
churning our to other companies. So, this is the problem statement. What
you’re basically going to do is a bit of data manipulation, data visualization
operations, and then you’ll go ahead and build the ML algorithms on top of this
dataset. You will start off with the linear regression
algorithm and then you’ll build logistic regression, and random
forest algorithms. We will be working with customer_churn dataset. So this is our dataset which comprises all of these columns. So
we’ve got customer ID, gender, senior citizen, partner, dependence, and so on.
So, this is our first task, data manipulation. To do this, let’s
go ahead and actually import all of our libraries. So I’ll just type in import pandas as pd and then I’ll also load the
NumPy library, so I’ll type import numpy as np. I’d also need the matplotlib
library. So I’ll type in ‘from matplotlib import pyplot plt.
So, these are all my required libraries. I’ll just wait till these
libraries are loaded. This is done. I’ll also go ahead and load up my
customer_churn data frame, and I’ll store it into an object and name that
object to be equal to customer churn I’ll just use pd.read_csv and inside this I will give the name of
the data frame which is customer_churn.csv So I have successfully loaded
the file and I’ve stored it into a new object and also have named that new
object to be equal to customer_churn. Now I’ll go ahead and have a glance at the
head of this. So customer_churn.head and this is the dataset on which we’ll be implementing all of the operations. So
this is just the unique customer ID. This column gender tells us about the gender
of the customer whether the customer is male or female.
The senior citizen column tells us whether the customer is a senior citizen
or not. So if it’s zero then the customer is not a senior citizen if it’s one then
the customer is a senior citizen. and this tells us if the customer has a
partner or not. This column tells us if the customer has dependents or not. This is
the tenure of the customer in months. It’s 1 month, 34 months, 2
months, and so on. This tells us if the customer has phone service or not. Does
the customer have multiple lines, and this is the type of Internet service
used, and then whether the customer has online security, whether he has device
protection, tech support, streaming TV, streaming movies, and so on, and then this
column is for contract. So this just tells you the contract type of the
customer. So, the contract type of the customer could either be month-to-month,
one year, or two years and there’s a type of billing whether it is paperless or
not. After this, we have the type of payment method. So the type of payment
method could be electronic check, mailed check, bank transfer, and so on. These are the monthly charges and total charges incurred by the customer. So now let’s start off with our data
manipulation tasks. So this is our very simple task. We just have to extract some
individual columns from the entire data frame. We will have to extract the fifth column
and store it in customer_5. So let me do that. I’ll type in the name of the data frame first, which is customer_churn and I would use .iloc. I would need all of
the rows and I would need a fifth column since
the indexing starts from 0. So, 0, 1, 2, 3, and 4. So this
would be my fifth column over here, and I will store this in
let’s say C_5 and I’ll have a glance at the head of this C_5.head. So we have successfully extracted
the fifth column from this entire data frame and this is the head of it.
Similarly, we will have to extract the fifteenth column and store it in
customer 15. So if it’s the fifteenth column then the index number
would be 14 because again the indexing starts from zero right I’ll stow this NC
underscore fifteen and I’ll also make this to be C and the score 15 click on
run right so this is the streaming movies column and I get the same thing
over here this is column number 15 and I’ve extracted only this particular
column from all from the enth idea frame all right this was a PC extraction of
columns from the entire data frame of that we’d have to do data extraction on
the basis of a condition flap extract all the mail senior citizens whose
payment method is electronic check right so there are three conditions over here
the first conditioners the gender of the customer needs to be me second
conditioners senior citizen the value of senior citizen needs to be equal to 1
and the third conditioners the payment method needs to be equal to electronic
check so i’ll given all of these three conditions over you i will start off by
giving the first condition system would shown file type in gender and the gender
needs to be equal to me right I will turn this and piece this inside this of
this I’ll use the and operator and then given the second condition so the second
condition would go something like this customer had shown after this we’d have
to set the senior citizen value right so senior citizen this needs to be
equal to one and then I’ll go ahead and given the third condition
so over you though third condition this is for the payment method column so the
payment method needs to be equal to electronic check all right so I have all of my three
conditions over you I’ll just use the cell below now what I’ll do is I will
copy all of these three conditions and I will paste them inside this and I will
store it into a new object and name the object to be C random right now I will
print the head of this see random dog head
so we see that this if you have a glance of his gender column and you’ll notice
that all of the values are me similarly if you have a glance at the senior
citizen column then all of these values of one similarly I will go to the
payment method column then you’ll notice that all of these values are electronic
check so I’ve given three conditions over here and all of these conditions
are being satisfied so next up we have to extract all of
those customers whose tenure is greater than seventy or their monthly charges is
greater than hundred dollars okay again we’ll do the same thing customer-owned Daniel needs to be
greater than 70 I’ll put this up inside this now what
you need to move this over here is we are using the or operator so this it’s
either the first condition or the second condition right so if one of these
conditions is true then we’ll get that particular record
so again customers shown and this time that’s the monthly touches so the
monthly charges have to be greater than 100 these are the two conditions so
either the tenure of the customer needs to be greater than 70 months or the
monthly charges of the customer needs to be greater than $100
again I’ll insert a cell below and all I’ll do is cut this and paste it inside
this right and I will store this end again
see random now let me again trim the head of the
see random-dot head right now I’ll head onto the ten your
column now you see that none of the tenure or none of the values of the
tenure over here is greater than 70 but if I go to the monthly charges column
then you’ll notice that the monthly charges are greater than 100 so it’s
either/or so one of these conditions has to be true and in this case we see that
the second condition is true over here right so if either of the condition is
true then we’ll get that entire record afters we’d have to extract all those
customers whose contractors of two years payment method Ismail check and value of
Chernus es alright I’ll again copy this piece this over here oh let me
delete this from this right so we have these conditions over here so we’ll have
to extract those customers bad contractors of two years payment method
is mail check and shown as equal two years so let me put in all of these
conditions over here first this contract and this needs to be equal to two years
so let me just check how is it contract as of two we’re right to space you after
that it’s the second condition and over here the payment method needs to be
equal to males check again I’ll put in the double equal to operator I’ll give
in the value and the value is equal to a check and then I’ll give in the final
condition over here so customer churn and this time the
churn needs to be equal do yes
right so that’s given all of these three conditions and I’ve separated them using
D and operator again I will store this in C and the score random let me turn this out C underscore random
so we see that there are just three records or there are just three
customers who satisfy all of these three conditions so let me have a look at the
contract of this right so contractors of two years for all of these three
customers next is the payment method and again payment method Ismail check and
showing all of these values are yes right so out of all of those seven
thousand rows only there are three rows which satisfy these three conditions next this is a question on random
sampling so we just have to extract 333 random records from the entire data
frame and to do this we’ll be using the sample function so I’ll type in customer
shown and I’ll use the sample method over here and inside this core I’ll do
is I’ll give the value of the number of record is our sample so I want 333
records I’ll students see underscore 3 3 3
let me bring down the head of it right so now for a Louis so whenever I
run this every time I got a different sample of 333 records I run this I’ll
run the second right so if you have a glance over here so all of these values
would be changing the customer IDs would be changing the indexes would be
changing great so keeper glance or have a glance at these row IDs indexes over
here right so again if you have a glance for this the why these are changing so
this is random sampling I’m randomly sampling 333 records from this entire
data frame over here and I’m doing that with the help of the sample method right and then this is the final
operation when it comes to data manipulation so I’d have to get the
count of different levels from the Chowan column so if I want to get the
count of the different levels present in a categorical column I have the value
counts method so first I will give in the name of the reader frame which is
customer shown or further I will given the name of the column which is shown
and then I’ll just type in new value counts right so this is it
so let me just wait till I get the result right so we see that no so the
number of customers will not be turning out as five thousand one hundred and
seventy four and there are number of customers who will be turning out is
eighteen hundred and sixty-nine so you can do the same thing for other
categorical columns as well so let’s say if I want to get down to values or there
are number of counts of different levels for let’s say the contract column I’ll
just change the name of the column over here so I’ll put it to be equal to
contract right so there are three thousand 875 customers whose contract
types of month-to-month there are sixteen hundred and ninety five
customers whose contract is of to use and there are fourteen hundred and
seventy three customers whose contract is of one you fit so these were some
basic data manipulation operations afterthe will held on to data
visualization right so here we’ll have to create a
simple bar plot for the internet service column and yeah we’ll have to set the
x-axis label two categories of Internet service y-axis label to count of
categories the title of the plot distribution of Internet service and the
color of the bars need to be equal to orange right so I’ll just type in PLD
dot bar over here now I’ll insert another cell over here now what I’ll
actually do is so this basically takes in two parameters first so the first
parameter is the names of all of the bars and the second parameter is the
values for those bars right so the names of the bars would be coming from the
internet service column so internet service and I actually want the value
rounds of this value accounts and from this I don’t want the values I just need
the keys and again I’d have to convert them into
a list so I’ll use finished over here I’ll click on run let’s see what do we
get right so these are the three levels present in the internet service column
so from this internet service call him what I’ve done is I’ve used the value
counts method so this value counts method has two things keys and values
now I don’t want the values as such I spawn the keys and I’ll take these keys
and I’ll convert these keys into elena’s right so this is the list of the names
present in this internet service column now I’ll cut this and piece it over here
and this would be my first parameter and my second parameter would be all of the
values and if I want all of the values I’ll just remove the method keys from
there right so this over here would give me all of the values present with
respect to this infinite service column right so fiber-optic or in other words
the number of customers whose Internet services fiber-optic as 3096 number of
customers whose Internet services DSL as 2004 21 and number of customers who
don’t have used the internet service or 1526 right now again
I’ll cut this and I’ll paste it over here now let me bring this out so this
is my pisac our plot over here right so this is my bar plot on the x axis I have
the names representing these bars over your plate so this bar is for all the
customers whose Internet services fiber optic this is for those customers whose
Internet services DSL and it’s for those customers who are not revealing the
internet service and these are their counts over here present only by access
now they have to do some other things over you I had to change the color of
the bars so the color of the bars was supposed to be set to orange and for
this I have the color parameter and I’ll just set it to be equal to orange I’ll
run this again right so we have successfully changed the color of these
bars over here now I’d have to set the x-axis label and the y-axis label so
I’ll just type in PLT dot X label and the X label needs to be equal to
categories of internet service let me type the down
categories of internet service after that I would need the need to put in the
label for the y-axis this would be PLT thought why label and
as a spirit to be equal to come and then finally I’d have to given
the title so PLT dot title and I will set the title audio and the title needs
to be equal to distribution of Internet service
let me type it out distribution of Internet service alpha
quadrant correct so this is a final bar plot which gives us the distribution of
Internet service and the x-axis legal list categories of Internet service and
the y-axis label discount right so this is how we can create a simple bar plot
and you know all of this so these are the basic steps behind you know before
you go ahead and build all of your machine learning algorithms so the
pre-processing the data pre-processing part is always the main power of your at
the assigns lifecycle this is where you properly comprehend your data set this
is where you understand the structure of the data set you understand the
correlation between all of the columns you know the correlation between the
dependent variable and the independent variable so by manipulating the data set
and visualizing the structure of the via set this is where you understand all the
patterns in the data set and you get insights from the data set alright so next up we have to build a
histogram for the tenure column so eigen will be your similar operation
PL d dot hist and I’d have to build a histogram for
the 10 your columns will be customer shown then your and I’d have to set the
number of bins to be equal to 30 and I’d have to set the color to be equal to
green so the stipend color equals green over here
so this is our histogram and this gives us the distribution of the tenure of the
customers so yeah so if you look at it closely so this is basically though come
to you right so there are yep so there are around 800
old customers whose tenure has not even one month so they are churning out
before they even complete one month and again there’s a huge peak ouya so there
are around more than 600 customers who’s 10 yrs or 70 months or more than 70
months since interest is pretty much the same rate so the average customer
yeah so this is the normal range of the customers is between 200 to 400 and the
average tenure of the customer you can say would be somewhere between 20 months
to 3 months or 60 months right and this is where you have the peak so you have
the peak at the starting and you have the peak at the ending
again let’s go ahead and add the title shall be just PLD dot title over here
and the title of the plot needs to be equal to distribution of tenure while
Loula all right so I have created this plot
and this is the title of the plot which is distribution of tenure so we’ve made
a bar plot familiar histogram now you guys also need to understand the
difference between a bar plot and the histogram so a bar plot is normally used
for all of the categorical columns so whenever you want to understand the
distribution of categorical columns that is when you go with a bar plot and when
you want to understand the distribution of for continuous numerical column
darkness when we’ll go with a histogram right so next up we’d have to create a
scatter plot between monthly charges and tenure so tenure is on x-axis and
monthly charges is on y-axis so plb does scarab
x-axis would be 10 yard so customer tuned than you are and then I’ll sit in
the column for the y-axis this will be customer shown and this will be equal to
monthly charges let me just run this right so this is what we get over here
now let me also set in the labels were so PLD dot X label and this would be
equal to then you’re let me type in tenure you
similarly I’ll also send available over you so this will be PL d dot Y label and
this would be equal to monthly charges right so now we also get the
corresponding X and y axis labels after this I’ll also go ahead and set the
title so it’ll be PLT dot title and this would be monthly charges versus
tenure monthly charges versus tenure right so
this is our final scatterplot where we have the x axis and y axis labels and
this is the title which is monthly charges versus tenure and finally we’d have to also build a
box plot between the 10-yard column and the contract column so tenyo needs to be
on the y-axis and contract needs to be on the x-axis
for this I love this type and customer tuned dot boxplot and so now I’ll send this to be equal to
customer churn on the track
and after this I have the column to be equal to customer shown and this needs
to be equal to they are let’s see well the arrow here
columns north phone so let me actually remove this from over here and let’s see
what happens all right so now we get the result so we
had actually given the name of the DFA mini shake itself so this customer would
show dot boxplot and now all you have to do is assign the contract on the x-axis
so now when I said by equals to contract what is happening is I laughed one box
plot each was at different levels of the contract column so I have one box plot
for the month to month level I have another box plot for the one year level
and another box plot for the to you level and over here though by axis this
is being determined by the ten-yard column right so this over here 0 to 70
this is the tenure of the customer and what we understand from this box plot
over here so if the contract of the customer is of two years then most
probably the median tenure of the customer is very high so if the contract
of the customer is of to use then his tenure or median tenure would be around
65 months similarly the contract of the customer is one year then the median
tenure of the customer would be around 45 months and then if the contract of
the customer is month-to-month then the median tenure of the customer would be
around or 50-knot months right so these were all of the examples
of visualization now it’s finally time to head on to machine learning right so
this was your data pre-processing part where you had understood the structure
of the data you hide learn how to extract individual columns and after
that you learn how to you know visualize the data and get some interesting
insights from the structure of the data we’ll start off with our first machine
learning algorithm which would be linear regression over here and linear
regression as you already know so over here though or dependent variable would
be a numerical column and a basically trying to understand how does one
variable change with respect to another variable and over here we’ll have to
build a simple linear model where our dependent variable is monthly charges
and the independent variable is equal to ten you’re right so or in other words we
are basically trying to understand how does monthly charges vary with respect
to any are so monthly charges dependent variable then your is the independent
variable and these are all of the subsets when it comes to this linear
model so we’ll start off by dividing the data set into 70/30 split and then we’ll
build the model on the train set break the values on the test set after that
we’d have to find out the root mean square error and he will have to print
out the true mean square error so let me go ahead and import the linear
regression model from ASCII loan so I’ll type in from a scale on import linear
model after this all happened from a scale on dot linear model import I need
linear regression right so these are my two basic like B so I need a linear
model and use linear regression now I would also require the Train test split
so I’ll happen from a scalar n– dot model selection import train test split
right so the strain test split method would help me to divide my dear acid
into training in testing sets so now it’s time to divide mérida into training
and testing sets so before that I’d have to get my target and the features or in
other words I have to separate my dependent variable and the independent
variable so monthly charges is the dependent variable so what I’ll do worse
y equals and I’ll extract only the monthly charges column
and I will store it and a new readable and in that variable to be equal to Y
similarly I’ll only extract the tenure column right so I’m extracting the monthly
charges column and I am storing it into a new object naming the object will be
equal to Y similarly I am extracting only t10 your column and I am storing
that column new X when golems customer churn what seems to be the problem over here
monthly charges let me put it to be capital C over here
right now let me bring out the head of these
two wide or head and X dot head
right so these are the values from monthly charges column and these are the
values from the 10-yard column now let me go ahead and divide these two into
training and testing sets so I’ll use in training tests split I’ll pass in X as
the first parameter so all the features would be which are stored in X who has
the first parameter after that I’d have to give in the target labels and the
target labels are which are Biscay my monthly charges which is stored in Y and
then finally I’d have to give in the test size
so let me check what was the test size so the test size was supposed to be
0.70 and then i’ll also said our random state so if I want to use these values
again I can just set the random state to be equal to the same value push I’m
giving over here this is smaller so video right so this test size zero point
seven zero Biscay means that 40% of the records would go into the testing set oh
this has to be zero point three zero sorry for that
yeah so 30% of the records would go into the test set and 70 percent the rest 70
percent of the records would go into the training set now I’ll be getting four
results over here and those four results are extreme x extreme white rain x just
and whitest these are actually the levels which we conventionally use I’ll
explain what these are exactly extreme white rim and then we have since
this will actually be X test first extreme extras white rain and whitest so
your extreme represents all of the you know all of those values of your
features which are present in the training set X test represents all of
the features which are present in the test set y train represents all of the
dependent values which are present in the train set and the whitest represents
all of the dependent values which are present in the test set and whenever we
are building a model will build that model on top of the train set right so
will build the model on top of extreme and white rain let me also show you the
Sheep or follow these so extreme god shape
letting the scene for the rest extreme not cheap I’ll make this to be
white rain I mean this to be exist and I’ll make this to be widest right so extreme why training so the
training set has these mini cars and the testing set has these many records over
here right so these are all of the features which are present in the
training set and these are all of the features which are present in the design
of the pictures of your dependent variables and these are all the features
of the independent variables and these are all of the target values when it
comes to the test set and these are all of the target values when it comes to
the event comes to your dependent variables correct so now that we have
extreme why train X test and reduced overhead double the model on top of the
training data now I will go ahead and create an
so normally your training data would be bigger because so let’s say your splits
on either 70/30 65 ODR or 80/20 because the more data you careful training that
is better but then again you can’t give out your entire data for the training
set right so the purpose of training your model is to make sure that your
model learns the underlying patterns of that data and once the learning is done
you’d have to also test how well the learning is done right and to test
you’ll also need a sample space for that test set so consider this simply so
let’s say you’re giving an exam and but for that exam let’s see if you got
hundred exercises so your syllabus comprised of all of the hundred
exercises and you’d have to learn all of those 100 exercises but when it comes to
your test it will have only ten exercises from all of those hundred
exercises right so the training needs to be done but then again the test space it
needs to be completely new which is not learned by the model or you know during
the training phase right that is why the training set has to be completely
different in the tests it has to be completely different and this the
division of training and testing set is done to make sure that overfitting
doesn’t happen and when overfitting happens the problem is this model will
perform well on this particular data set but when a new data set comes in it’ll
miserably feel which this is the reason why we divided data into training and
testing set right so now let me go ahead and create
an instance of the linear regression model and I’ll name that to be
regressive so I’ve created an instance of linear
regression over here and I’ll go ahead and fit the model on
top of the training set so it’ll be extreme
and white ring right so I fit the model on top of the
training set now it’s time to break the values so it’ll be regressive dot credit
and I’ll be predicting the values on top of the X test and I will store this in
let’s see Y pred now I’ve fit the model on top of the
training set and I have also predicted the values on top of the test set now
I’d have to know how well the prediction has been done and for this or when it
comes to linear regression we have something known as the root mean square
error so the lower the value of root mean square error the better your model
and again we have an inbuilt method to calculate root mean square error so I
just have to import a scale own dot metrics and from a scale own dot metrics
I be importing in where arrow after this so mean square what we
actually want is prove mean squared arrow so I would need the NP dot square
root and let me use the mean squared error and this takes in two parameters
first parameter comprised of the actual values which are present in whitest and
second parameter are the predicted values which are present in wipe red so
I will cut this and let me is this to be in this over here right so we get a root
mean squared error value of twenty nine point three nine right so now let’s see
if you build some other model with some of the from two dependent variables now
we’ll go ahead and build the model and we’ll also break down values now after
painting the values we also have to calculate root mean square error the
mean square error exactly is little so let’s if some other model let’s see
model of that model let’s say 39 then this model would be better than the
second model similarly if there is let’s say model three whose root mean square
error is 19 then that model fee would be better than this model which you pay to
you so here we are taking the monthly charges values that is exactly right so
let me actually show you that so let’s see when I put in Y prett and let’s say
I’ll have a glance for the first five values right so these are the monthly
charges predicted let me also show you why test five okay
these are the actual values and these are the predicted values over here three
so this is not exactly dependent of a turn of the customer so what we are
doing this we are building an entire Terra science lifecycle process so here
we are trying to understand the relationship between the tenure of the
customer and the monthly status of the customer right so here what we’re
basically trying to understand is let’s see if the tenure of the customer is at
around ten months what would be his monthly charges again if the tenure of
the customer is thirty months what would be his monthly charges similarly the
tenure of the customer is 70 months then what would be his monthly charge so this
is what we are trying to understand over here right so this time we’d have to
build a logic regression model where dependent feasible is shown and
independent variables our tenure and monthly charges
again I do the same thing let me actually copy these two so this is what we did over here rather
tail right so this is where we went ahead and tricked at the values to entry
right so I have divided my dataset into training and testing set and over here I
am fitting the model only on the train set right so training is happening on
the training set or the model is learning from the train set and we are
predicting the values on the test set to this model this a new regression model
has not yet seen all of the records present in the test set it has only
learned from the values which are present in the training set so we’ll go
ahead and build a linear regression model customers shown let me also put in wine
do the same thing and the in the logic regression model
our independent variable is monthly charges
let me good monthly charges over here and the
largest retrogression showing the dependent really blessed shown so I have extracted or I have got my
features and my target over you so features obviously my monthly charges
and my target as present in the churn column and I’d want to understand if the
customer would shown or not on the basis of the monthly charges of the customer
right so I’ve done this and the rest of the process would be the same for you I
will go ahead and divide this data frame in dude train and dust all right so this
is 6535 is show this time so I’ll say to be thirty five so this nine thirty four
thirty five percent of the records would be present in the test set and the rest
sixty five percent of the records would be present in the train set I’m storing
all of those into extreme extras white rain and whitest red Alice I’d have to
import the lowest regression model so from SK learn dot linear model
I’d have to import the law district regression model and I’d have to create
an instance of this so I’ll name this to be equal to log model
right so I have created an instance of this and I’ll go ahead and fit the model
on top of the training set with the entire process where it comes to you
with everything this is pretty much the same rates of Python makes it extremely
easy the SQL own library makes it extremely easy all you have to do is
take in the data find out your independent variables and your dependent
variable and then go ahead and divide those target and features into training
and testing set build the model on top of the train set and then break the
values on top of the test set and then you will find out the metrics so for
classification its confusion matrix and your normal accuracy score and for
linear regression it could be the root mean square error it mean square error
or moving average 0 so log model dot FET and I will fit this
model on top of the training silligan so extreme and white rain
right so I have with model yes we can also have multiple independent variables
sure so and so if we have a single independent variable then that’s you
know that’s a simple model so it comes to simple linear regression or you have
a single independent variable when it comes to multiple linear regression
you’ll have multiple independent variables right so this basically means
that we are trying to understand how does our dependent variable change with
respect to multiple independent variables over there
it’s hagen comes to this equation so linear regression the equation is based
on this let me actually open woof paddle he’ll wait so y equals let’s say x1 plus x2 plus x3
plus x4 so what happens in a simple linear regression if you just have one
independent variable which would be x1 what happens in multiple linear
regression is you have multiple independent variables which are X 2 X 3
X 4 and so on right and you’re trying to understand
how does Y vary with respect to all of these independent variables that’s
pretty much it right so let’s proceed with this we have
built the model on top of the training set now let’s go ahead and predict the
values on top of the test set so it’ll be log model dot method so here we take the value of x and um we
are not changing the value of x we have multiple x values over here right so
this becomes M x1 plus you know M 2 X 2 plus M 3 X 3 plus M M 4 X 4 and so on
have multiple independent variables and you are trying to understand how
does y vary with those multiple independent variables so log model dot product and I want to
predict my use on top of the test set and I’ll
store the same again wipe head so we have predicted the values now
right so when it comes to a classification model we can use the
confusion matrix right so let me import the confusion matrix so from SK learn
dot matrix I will import the confusion matrix also the accuracy score
now let me find out both of this I’ll pass in the actual values and the
predicted values actual values are present in whitest and predicted values
are present and vybrid similarly accuracy score I’ll do the same thing
widest and wipe red right so this is our confusion matrix
and this is the accuracy rate so to get the accuracy what I basically do is you
will divide the left diagonal with all of the values so left diagonal
represents all of your correctly predicted classifications this come
Frye’s of all of your true positives rates this part over here this is all of
your true positives and this is all of your true negatives right so when you
divide this with the entire sample space that is when you will get the accuracy
so let me do that so that will be one eight one five divided by one eight one
five plus six five one and I get the same accuracy which is
seventy three point six zero right fine and curiously on seventy three percent
for the model which I’ve been I have to build a multiple loyalty
regression model while the dependent variable is the same and the independent
variables are tenure and monthly status so now our we have two independent
variables so I’ll make the changes here itself right so Charles the question
which you’re asking what if we had multiple independent variable stripes
over here we have two independent variables so this time the independent
variables are monthly charges and tenure and I am trying to understand whether
the customer would shown or not on the basis of these two columns which are
monthly charges and then you’re right so X now would comprise of these features
and Y is the same and the ratio is 8020 so I’ll change the test size to be equal
to 20 over you I’ll go ahead and fit the model it’s the same again I’ll shrink
the values is the same right so build the model I predicted the values the
only difference which I made over here is I have two independent variables this
time instead of a single independent variable right typing the values after
that I’ll import confusion metrics and accuracy score right now over here again
I’m calculating the confusion matrix and this and this time I get an accuracy of
seventy seven point five zero right so this time will be 935 plus 150 seven
divided by 935 plus 150 seven plus one hundred and six plus 211 right so we get
an accuracy of seventy seven point five zero so this left diagonal are all of
the values which have been correctly classified and this fried diagonal
represents all of those values which have been mixed which have been
misclassified so this where is the logistic regression
and then we’ve got two more machine learning algorithms left which are
decision free and random forests so let’s fill these two correct
so for decision tree dependency ribbon is the scene which is churn and
independent variable is tenure right so let me manually do with this
hexxus customer churn and the independent variable as tenure let me also extract the dependent
variable so the dependent variable would be shown
how to make sure that you bring the Spelling’s correct and also how to take
care of the small caps in the capital letters over you right now I will go
ahead and also import the decision tree classifier
so from a scale learned dot tree I’ll be importing the decision tree
classifier now I’ll go ahead and divide this data film into train and test so
it’ll be the same process let me copy this let me paste it over here
so the split as 80/20 which is what we are doing over here right so this is our
feature this is our target labels and we have divided the data set into training
and testing set now let me go ahead and create an instance of this decision tree
classifier I’ll name this instance to be let’s say my tree
right and I’ll go ahead and fit the model on top of the training set so my
tree dot fit the stakes in two parameters which are extreme and white
rain I fit the model now it’s time to break the values my tree dot predict and
I’ll be predicting on top of the X test now I’ll import the matrix right
so let me calculate the confusion matrix and also the accuracy score
confusion metrics so I’ll actually have to store this in our object first again
I’ll be storing this in to wipe read let me run the cell again and this again takes in two parameters
first parameter is all of the actual values which are present in whitest and
all the predicted values which are present and why Fred
so this is our confusion matrix now let me calculate the accuracy which would be
965 plus 87 divided by 965 plus 87 plus 280 ones plus 76 right
so for this decision tree model we get an accuracy of 74 percent right so the
processes entirely they’re seeing guys rate what we are doing this again at the
risk of being redundant so what we are basically doing us finding out our
features and target variable off large we are dividing the features and target
into training and testing split and then we’ll go ahead and build the model on
top of the train data and then we’ll take the values on top of the test data
once the prediction is done well calculate the accuracy for to find out
how well our model has learned right so the in the entire data science lifecycle
our most important part is the data processing part right so because to the
model building so it depends on your problem statement
the other term right so let’s see if your dependence on your prob it depends
on your problem statement and what exactly you are trying to find out so
let’s say if your dependent variable is continuous so you know if it’s a
continuous variable if it’s a continuous numerical then you will go with linear
regression and if you want you know if you if you have multiple categories
right so if you have multiple categories then you will go with either decision
tree or random forest so when it comes to logistically aggression it is a
binary classifier so let’s see if you are just two or you know two labels in
your dependent variable like this case over here right so this is where you
will go with caustic regression but then again if you have multiple categories or
if it’s a multi classification problem then you will most probably go with
decision tree and random forest and again when you compare the season tree
and random forest or random foresters always better than decision tree because
random forest is an ensemble model so again through your course you would have
learned that parandham forester is nothing but an ensemble of decision
trees and the accuracy or the prediction given by a random forest this better
than a decision tree so that you can actually take for granted right so what
is the accuracy of decision free gives the curiosity given by a random forest
is better so just to verify that let me actually go ahead and build a random
forest model on this same data over here right so we take the same x and y values know so I think a problem statement was
different out there so linear regression we used to understand how do the monthly
charges vary with tenure but the other algorithms which we’ve built that was to
understand how does churn vary with other factors right so over their
monthly charges was a numerical column but shown is a categorical column so we
can compare logistic regression and decision tree right so let’s see if we
compare logistically aggression and decision tree right so logistically
aggression girls are an accuracy of 77% yes yes right so logistically aggression
till now yes has given us the best accuracy right right so now finally we also go ahead
and build an ensemble model which is random forest
so let’s actually compare the accuracy given by this decision tree and the
random forest so from eske learned art ensemble i’ll import
the random forest classifier and I’ll create
an instance of this maybe I’ll name this as our and I’ll just create an instance you
correct and I’ll go ahead and put the model on top of X train and white ring extreme and white rain I fit the model now it’s finally time to pick the value
so our F dot credit and I am predicting the values on top of the X test let me
build the confusion matrix over here right now let me take the accuracy so
I’ll just type in accuracy score and i’ll bison pass and whitest and white
red quietest ripe red so you get an accuracy
of seventy four point six six and at seventy four point six six right so
that’s not much of a difference when you compare this region tree and random
forest over here right so we’ve got the same values right but normally in
general if you are building a random forest model so you’ll either get a
better accuracy or you get the same accuracy right so this is your normal
entire the assigns lifecycle you’ll start off with the data pre-processing
pod data exploration but where you will understand the structure of the data set
you will visualize the data set and understand whatever is happening
underneath it right after that once you understand and comprehend your data
properly that is when you will go ahead and lay your model again when it comes
to building your model you will sort of follow the same procedure will find out
your independent variables and the dependent variable then you’ll go ahead
and divide those two into your training and testing set we’ll build the model on
top of the training set and then you’ll break the values on top of the test set
and finally you will calculate the accuracy
very so um here
this is pretty much it for to be session right so this is your entire end-to-end
project on the customer churn data set which comprise of data manipulation data
visualization and implementing all the machine learning algorithms so this
brings us to the end of the session if you have any queries do comment below
we’ll reach out immediately also do subscribe to our channel so that you
don’t miss out on any of the upcoming videos

37 thoughts to “Python Projects for Beginners | Python Projects | Intellipaat”

  1. Guys, what else do you want to learn from Intellipaat? Comment down below and let us know so we can create more such tutorials for you.

  2. Automation with Python plse

    You are awesome plse make more video like this 😍😍😍😍

  3. Really appreciate it. Amazing work not only work it just a combination of smart+hard work. Intellipat team is just amazing.

  4. i think you should make videos on Projects of ML, DS, AI which will boast your channel popularity and people can understand the idea behind the concepts.
    these videos contains all the concepts which we have learned in python language learning if you are uploading about 10 different videos on 10 different projects these series will help a lot where everyone can get the ideas

  5. Team , excellent video.
    Can you please help me with the following?
    1. Where we can find the dataset used in this case study?
    2., how do we get the toll tip, when we type some keywords like read_ drop down list appears. how to get this.

  6. πŸ‘‹ Guys everyday we upload high quality in depth tutorial on your requested topic/technology so kindly SUBSCRIBE to our channelπŸ‘‰( http://bit.ly/Intellipaat ) & also share with your connections on social media to help them grow in their career.πŸ™‚

  7. Sir i am facing issue on logistics regression, the model can not fit., i run the same code but it shows the error. (value error)

Leave a Reply

Your email address will not be published. Required fields are marked *