Hey guys, welcome to the session by Intellipaat. So, python is a versatile language which is used for multiple

purposes such as web development, machine learning, and deep learning, and the best

way to get expertise in Python is to work on hands-on projects. So today, we

have come up with this capstone project on Python so that you get a complete

understanding of the data science life cycle.

So, before we start off with the class, do subscribe to our channel so that you get

a notification of our next video. So, let’s go through the agenda. We’ll

start off by implementing some data manipulation operations with the pandas

library, and then we’ll use matplotlib to visualize the underlying data, after

that we’ll implement some machine learning algorithms using Scikit-learn.

Finally, there’ll be a quiz to recap what we have learnt in today’s session.

So, do put down your answers in the chat box so that you know if you have answered it

correctly. Also, you can put down all of your queries in the comment section, we’d love to help you out. So without much delay, let’s start off with the

class. Consider yourself to be a Data Scientist at a prestigious telecom

company, and the name of that company is ‘Neo,’ and they’re facing a major problem

and that basically is that their customers are churning out to other competitors. Now,

you as a Data Scientist at that particular company, have to make sure to

stop this churning out and also find out the reasons why customers are

churning our to other companies. So, this is the problem statement. What

you’re basically going to do is a bit of data manipulation, data visualization

operations, and then you’ll go ahead and build the ML algorithms on top of this

dataset. You will start off with the linear regression

algorithm and then you’ll build logistic regression, and random

forest algorithms. We will be working with customer_churn dataset. So this is our dataset which comprises all of these columns. So

we’ve got customer ID, gender, senior citizen, partner, dependence, and so on.

So, this is our first task, data manipulation. To do this, let’s

go ahead and actually import all of our libraries. So I’ll just type in import pandas as pd and then I’ll also load the

NumPy library, so I’ll type import numpy as np. I’d also need the matplotlib

library. So I’ll type in ‘from matplotlib import pyplot plt.

So, these are all my required libraries. I’ll just wait till these

libraries are loaded. This is done. I’ll also go ahead and load up my

customer_churn data frame, and I’ll store it into an object and name that

object to be equal to customer churn I’ll just use pd.read_csv and inside this I will give the name of

the data frame which is customer_churn.csv So I have successfully loaded

the file and I’ve stored it into a new object and also have named that new

object to be equal to customer_churn. Now I’ll go ahead and have a glance at the

head of this. So customer_churn.head and this is the dataset on which we’ll be implementing all of the operations. So

this is just the unique customer ID. This column gender tells us about the gender

of the customer whether the customer is male or female.

The senior citizen column tells us whether the customer is a senior citizen

or not. So if it’s zero then the customer is not a senior citizen if it’s one then

the customer is a senior citizen. and this tells us if the customer has a

partner or not. This column tells us if the customer has dependents or not. This is

the tenure of the customer in months. It’s 1 month, 34 months, 2

months, and so on. This tells us if the customer has phone service or not. Does

the customer have multiple lines, and this is the type of Internet service

used, and then whether the customer has online security, whether he has device

protection, tech support, streaming TV, streaming movies, and so on, and then this

column is for contract. So this just tells you the contract type of the

customer. So, the contract type of the customer could either be month-to-month,

one year, or two years and there’s a type of billing whether it is paperless or

not. After this, we have the type of payment method. So the type of payment

method could be electronic check, mailed check, bank transfer, and so on. These are the monthly charges and total charges incurred by the customer. So now let’s start off with our data

manipulation tasks. So this is our very simple task. We just have to extract some

individual columns from the entire data frame. We will have to extract the fifth column

and store it in customer_5. So let me do that. I’ll type in the name of the data frame first, which is customer_churn and I would use .iloc. I would need all of

the rows and I would need a fifth column since

the indexing starts from 0. So, 0, 1, 2, 3, and 4. So this

would be my fifth column over here, and I will store this in

let’s say C_5 and I’ll have a glance at the head of this C_5.head. So we have successfully extracted

the fifth column from this entire data frame and this is the head of it.

Similarly, we will have to extract the fifteenth column and store it in

customer 15. So if it’s the fifteenth column then the index number

would be 14 because again the indexing starts from zero right I’ll stow this NC

underscore fifteen and I’ll also make this to be C and the score 15 click on

run right so this is the streaming movies column and I get the same thing

over here this is column number 15 and I’ve extracted only this particular

column from all from the enth idea frame all right this was a PC extraction of

columns from the entire data frame of that we’d have to do data extraction on

the basis of a condition flap extract all the mail senior citizens whose

payment method is electronic check right so there are three conditions over here

the first conditioners the gender of the customer needs to be me second

conditioners senior citizen the value of senior citizen needs to be equal to 1

and the third conditioners the payment method needs to be equal to electronic

check so i’ll given all of these three conditions over you i will start off by

giving the first condition system would shown file type in gender and the gender

needs to be equal to me right I will turn this and piece this inside this of

this I’ll use the and operator and then given the second condition so the second

condition would go something like this customer had shown after this we’d have

to set the senior citizen value right so senior citizen this needs to be

equal to one and then I’ll go ahead and given the third condition

so over you though third condition this is for the payment method column so the

payment method needs to be equal to electronic check all right so I have all of my three

conditions over you I’ll just use the cell below now what I’ll do is I will

copy all of these three conditions and I will paste them inside this and I will

store it into a new object and name the object to be C random right now I will

print the head of this see random dog head

so we see that this if you have a glance of his gender column and you’ll notice

that all of the values are me similarly if you have a glance at the senior

citizen column then all of these values of one similarly I will go to the

payment method column then you’ll notice that all of these values are electronic

check so I’ve given three conditions over here and all of these conditions

are being satisfied so next up we have to extract all of

those customers whose tenure is greater than seventy or their monthly charges is

greater than hundred dollars okay again we’ll do the same thing customer-owned Daniel needs to be

greater than 70 I’ll put this up inside this now what

you need to move this over here is we are using the or operator so this it’s

either the first condition or the second condition right so if one of these

conditions is true then we’ll get that particular record

so again customers shown and this time that’s the monthly touches so the

monthly charges have to be greater than 100 these are the two conditions so

either the tenure of the customer needs to be greater than 70 months or the

monthly charges of the customer needs to be greater than $100

again I’ll insert a cell below and all I’ll do is cut this and paste it inside

this right and I will store this end again

see random now let me again trim the head of the

see random-dot head right now I’ll head onto the ten your

column now you see that none of the tenure or none of the values of the

tenure over here is greater than 70 but if I go to the monthly charges column

then you’ll notice that the monthly charges are greater than 100 so it’s

either/or so one of these conditions has to be true and in this case we see that

the second condition is true over here right so if either of the condition is

true then we’ll get that entire record afters we’d have to extract all those

customers whose contractors of two years payment method Ismail check and value of

Chernus es alright I’ll again copy this piece this over here oh let me

delete this from this right so we have these conditions over here so we’ll have

to extract those customers bad contractors of two years payment method

is mail check and shown as equal two years so let me put in all of these

conditions over here first this contract and this needs to be equal to two years

so let me just check how is it contract as of two we’re right to space you after

that it’s the second condition and over here the payment method needs to be

equal to males check again I’ll put in the double equal to operator I’ll give

in the value and the value is equal to a check and then I’ll give in the final

condition over here so customer churn and this time the

churn needs to be equal do yes

right so that’s given all of these three conditions and I’ve separated them using

D and operator again I will store this in C and the score random let me turn this out C underscore random

so we see that there are just three records or there are just three

customers who satisfy all of these three conditions so let me have a look at the

contract of this right so contractors of two years for all of these three

customers next is the payment method and again payment method Ismail check and

showing all of these values are yes right so out of all of those seven

thousand rows only there are three rows which satisfy these three conditions next this is a question on random

sampling so we just have to extract 333 random records from the entire data

frame and to do this we’ll be using the sample function so I’ll type in customer

shown and I’ll use the sample method over here and inside this core I’ll do

is I’ll give the value of the number of record is our sample so I want 333

records I’ll students see underscore 3 3 3

let me bring down the head of it right so now for a Louis so whenever I

run this every time I got a different sample of 333 records I run this I’ll

run the second right so if you have a glance over here so all of these values

would be changing the customer IDs would be changing the indexes would be

changing great so keeper glance or have a glance at these row IDs indexes over

here right so again if you have a glance for this the why these are changing so

this is random sampling I’m randomly sampling 333 records from this entire

data frame over here and I’m doing that with the help of the sample method right and then this is the final

operation when it comes to data manipulation so I’d have to get the

count of different levels from the Chowan column so if I want to get the

count of the different levels present in a categorical column I have the value

counts method so first I will give in the name of the reader frame which is

customer shown or further I will given the name of the column which is shown

and then I’ll just type in new value counts right so this is it

so let me just wait till I get the result right so we see that no so the

number of customers will not be turning out as five thousand one hundred and

seventy four and there are number of customers who will be turning out is

eighteen hundred and sixty-nine so you can do the same thing for other

categorical columns as well so let’s say if I want to get down to values or there

are number of counts of different levels for let’s say the contract column I’ll

just change the name of the column over here so I’ll put it to be equal to

contract right so there are three thousand 875 customers whose contract

types of month-to-month there are sixteen hundred and ninety five

customers whose contract is of to use and there are fourteen hundred and

seventy three customers whose contract is of one you fit so these were some

basic data manipulation operations afterthe will held on to data

visualization right so here we’ll have to create a

simple bar plot for the internet service column and yeah we’ll have to set the

x-axis label two categories of Internet service y-axis label to count of

categories the title of the plot distribution of Internet service and the

color of the bars need to be equal to orange right so I’ll just type in PLD

dot bar over here now I’ll insert another cell over here now what I’ll

actually do is so this basically takes in two parameters first so the first

parameter is the names of all of the bars and the second parameter is the

values for those bars right so the names of the bars would be coming from the

internet service column so internet service and I actually want the value

rounds of this value accounts and from this I don’t want the values I just need

the keys and again I’d have to convert them into

a list so I’ll use finished over here I’ll click on run let’s see what do we

get right so these are the three levels present in the internet service column

so from this internet service call him what I’ve done is I’ve used the value

counts method so this value counts method has two things keys and values

now I don’t want the values as such I spawn the keys and I’ll take these keys

and I’ll convert these keys into elena’s right so this is the list of the names

present in this internet service column now I’ll cut this and piece it over here

and this would be my first parameter and my second parameter would be all of the

values and if I want all of the values I’ll just remove the method keys from

there right so this over here would give me all of the values present with

respect to this infinite service column right so fiber-optic or in other words

the number of customers whose Internet services fiber-optic as 3096 number of

customers whose Internet services DSL as 2004 21 and number of customers who

don’t have used the internet service or 1526 right now again

I’ll cut this and I’ll paste it over here now let me bring this out so this

is my pisac our plot over here right so this is my bar plot on the x axis I have

the names representing these bars over your plate so this bar is for all the

customers whose Internet services fiber optic this is for those customers whose

Internet services DSL and it’s for those customers who are not revealing the

internet service and these are their counts over here present only by access

now they have to do some other things over you I had to change the color of

the bars so the color of the bars was supposed to be set to orange and for

this I have the color parameter and I’ll just set it to be equal to orange I’ll

run this again right so we have successfully changed the color of these

bars over here now I’d have to set the x-axis label and the y-axis label so

I’ll just type in PLT dot X label and the X label needs to be equal to

categories of internet service let me type the down

categories of internet service after that I would need the need to put in the

label for the y-axis this would be PLT thought why label and

as a spirit to be equal to come and then finally I’d have to given

the title so PLT dot title and I will set the title audio and the title needs

to be equal to distribution of Internet service

let me type it out distribution of Internet service alpha

quadrant correct so this is a final bar plot which gives us the distribution of

Internet service and the x-axis legal list categories of Internet service and

the y-axis label discount right so this is how we can create a simple bar plot

and you know all of this so these are the basic steps behind you know before

you go ahead and build all of your machine learning algorithms so the

pre-processing the data pre-processing part is always the main power of your at

the assigns lifecycle this is where you properly comprehend your data set this

is where you understand the structure of the data set you understand the

correlation between all of the columns you know the correlation between the

dependent variable and the independent variable so by manipulating the data set

and visualizing the structure of the via set this is where you understand all the

patterns in the data set and you get insights from the data set alright so next up we have to build a

histogram for the tenure column so eigen will be your similar operation

PL d dot hist and I’d have to build a histogram for

the 10 your columns will be customer shown then your and I’d have to set the

number of bins to be equal to 30 and I’d have to set the color to be equal to

green so the stipend color equals green over here

so this is our histogram and this gives us the distribution of the tenure of the

customers so yeah so if you look at it closely so this is basically though come

to you right so there are yep so there are around 800

old customers whose tenure has not even one month so they are churning out

before they even complete one month and again there’s a huge peak ouya so there

are around more than 600 customers who’s 10 yrs or 70 months or more than 70

months since interest is pretty much the same rate so the average customer

yeah so this is the normal range of the customers is between 200 to 400 and the

average tenure of the customer you can say would be somewhere between 20 months

to 3 months or 60 months right and this is where you have the peak so you have

the peak at the starting and you have the peak at the ending

again let’s go ahead and add the title shall be just PLD dot title over here

and the title of the plot needs to be equal to distribution of tenure while

Loula all right so I have created this plot

and this is the title of the plot which is distribution of tenure so we’ve made

a bar plot familiar histogram now you guys also need to understand the

difference between a bar plot and the histogram so a bar plot is normally used

for all of the categorical columns so whenever you want to understand the

distribution of categorical columns that is when you go with a bar plot and when

you want to understand the distribution of for continuous numerical column

darkness when we’ll go with a histogram right so next up we’d have to create a

scatter plot between monthly charges and tenure so tenure is on x-axis and

monthly charges is on y-axis so plb does scarab

x-axis would be 10 yard so customer tuned than you are and then I’ll sit in

the column for the y-axis this will be customer shown and this will be equal to

monthly charges let me just run this right so this is what we get over here

now let me also set in the labels were so PLD dot X label and this would be

equal to then you’re let me type in tenure you

similarly I’ll also send available over you so this will be PL d dot Y label and

this would be equal to monthly charges right so now we also get the

corresponding X and y axis labels after this I’ll also go ahead and set the

title so it’ll be PLT dot title and this would be monthly charges versus

tenure monthly charges versus tenure right so

this is our final scatterplot where we have the x axis and y axis labels and

this is the title which is monthly charges versus tenure and finally we’d have to also build a

box plot between the 10-yard column and the contract column so tenyo needs to be

on the y-axis and contract needs to be on the x-axis

for this I love this type and customer tuned dot boxplot and so now I’ll send this to be equal to

customer churn on the track

and after this I have the column to be equal to customer shown and this needs

to be equal to they are let’s see well the arrow here

columns north phone so let me actually remove this from over here and let’s see

what happens all right so now we get the result so we

had actually given the name of the DFA mini shake itself so this customer would

show dot boxplot and now all you have to do is assign the contract on the x-axis

so now when I said by equals to contract what is happening is I laughed one box

plot each was at different levels of the contract column so I have one box plot

for the month to month level I have another box plot for the one year level

and another box plot for the to you level and over here though by axis this

is being determined by the ten-yard column right so this over here 0 to 70

this is the tenure of the customer and what we understand from this box plot

over here so if the contract of the customer is of two years then most

probably the median tenure of the customer is very high so if the contract

of the customer is of to use then his tenure or median tenure would be around

65 months similarly the contract of the customer is one year then the median

tenure of the customer would be around 45 months and then if the contract of

the customer is month-to-month then the median tenure of the customer would be

around or 50-knot months right so these were all of the examples

of visualization now it’s finally time to head on to machine learning right so

this was your data pre-processing part where you had understood the structure

of the data you hide learn how to extract individual columns and after

that you learn how to you know visualize the data and get some interesting

insights from the structure of the data we’ll start off with our first machine

learning algorithm which would be linear regression over here and linear

regression as you already know so over here though or dependent variable would

be a numerical column and a basically trying to understand how does one

variable change with respect to another variable and over here we’ll have to

build a simple linear model where our dependent variable is monthly charges

and the independent variable is equal to ten you’re right so or in other words we

are basically trying to understand how does monthly charges vary with respect

to any are so monthly charges dependent variable then your is the independent

variable and these are all of the subsets when it comes to this linear

model so we’ll start off by dividing the data set into 70/30 split and then we’ll

build the model on the train set break the values on the test set after that

we’d have to find out the root mean square error and he will have to print

out the true mean square error so let me go ahead and import the linear

regression model from ASCII loan so I’ll type in from a scale on import linear

model after this all happened from a scale on dot linear model import I need

linear regression right so these are my two basic like B so I need a linear

model and use linear regression now I would also require the Train test split

so I’ll happen from a scalar n– dot model selection import train test split

right so the strain test split method would help me to divide my dear acid

into training in testing sets so now it’s time to divide mérida into training

and testing sets so before that I’d have to get my target and the features or in

other words I have to separate my dependent variable and the independent

variable so monthly charges is the dependent variable so what I’ll do worse

y equals and I’ll extract only the monthly charges column

and I will store it and a new readable and in that variable to be equal to Y

similarly I’ll only extract the tenure column right so I’m extracting the monthly

charges column and I am storing it into a new object naming the object will be

equal to Y similarly I am extracting only t10 your column and I am storing

that column new X when golems customer churn what seems to be the problem over here

monthly charges let me put it to be capital C over here

right now let me bring out the head of these

two wide or head and X dot head

right so these are the values from monthly charges column and these are the

values from the 10-yard column now let me go ahead and divide these two into

training and testing sets so I’ll use in training tests split I’ll pass in X as

the first parameter so all the features would be which are stored in X who has

the first parameter after that I’d have to give in the target labels and the

target labels are which are Biscay my monthly charges which is stored in Y and

then finally I’d have to give in the test size

so let me check what was the test size so the test size was supposed to be

0.70 and then i’ll also said our random state so if I want to use these values

again I can just set the random state to be equal to the same value push I’m

giving over here this is smaller so video right so this test size zero point

seven zero Biscay means that 40% of the records would go into the testing set oh

this has to be zero point three zero sorry for that

yeah so 30% of the records would go into the test set and 70 percent the rest 70

percent of the records would go into the training set now I’ll be getting four

results over here and those four results are extreme x extreme white rain x just

and whitest these are actually the levels which we conventionally use I’ll

explain what these are exactly extreme white rim and then we have since

this will actually be X test first extreme extras white rain and whitest so

your extreme represents all of the you know all of those values of your

features which are present in the training set X test represents all of

the features which are present in the test set y train represents all of the

dependent values which are present in the train set and the whitest represents

all of the dependent values which are present in the test set and whenever we

are building a model will build that model on top of the train set right so

will build the model on top of extreme and white rain let me also show you the

Sheep or follow these so extreme god shape

letting the scene for the rest extreme not cheap I’ll make this to be

white rain I mean this to be exist and I’ll make this to be widest right so extreme why training so the

training set has these mini cars and the testing set has these many records over

here right so these are all of the features which are present in the

training set and these are all of the features which are present in the design

of the pictures of your dependent variables and these are all the features

of the independent variables and these are all of the target values when it

comes to the test set and these are all of the target values when it comes to

the event comes to your dependent variables correct so now that we have

extreme why train X test and reduced overhead double the model on top of the

training data now I will go ahead and create an

so normally your training data would be bigger because so let’s say your splits

on either 70/30 65 ODR or 80/20 because the more data you careful training that

is better but then again you can’t give out your entire data for the training

set right so the purpose of training your model is to make sure that your

model learns the underlying patterns of that data and once the learning is done

you’d have to also test how well the learning is done right and to test

you’ll also need a sample space for that test set so consider this simply so

let’s say you’re giving an exam and but for that exam let’s see if you got

hundred exercises so your syllabus comprised of all of the hundred

exercises and you’d have to learn all of those 100 exercises but when it comes to

your test it will have only ten exercises from all of those hundred

exercises right so the training needs to be done but then again the test space it

needs to be completely new which is not learned by the model or you know during

the training phase right that is why the training set has to be completely

different in the tests it has to be completely different and this the

division of training and testing set is done to make sure that overfitting

doesn’t happen and when overfitting happens the problem is this model will

perform well on this particular data set but when a new data set comes in it’ll

miserably feel which this is the reason why we divided data into training and

testing set right so now let me go ahead and create

an instance of the linear regression model and I’ll name that to be

regressive so I’ve created an instance of linear

regression over here and I’ll go ahead and fit the model on

top of the training set so it’ll be extreme

and white ring right so I fit the model on top of the

training set now it’s time to break the values so it’ll be regressive dot credit

and I’ll be predicting the values on top of the X test and I will store this in

let’s see Y pred now I’ve fit the model on top of the

training set and I have also predicted the values on top of the test set now

I’d have to know how well the prediction has been done and for this or when it

comes to linear regression we have something known as the root mean square

error so the lower the value of root mean square error the better your model

and again we have an inbuilt method to calculate root mean square error so I

just have to import a scale own dot metrics and from a scale own dot metrics

I be importing in where arrow after this so mean square what we

actually want is prove mean squared arrow so I would need the NP dot square

root and let me use the mean squared error and this takes in two parameters

first parameter comprised of the actual values which are present in whitest and

second parameter are the predicted values which are present in wipe red so

I will cut this and let me is this to be in this over here right so we get a root

mean squared error value of twenty nine point three nine right so now let’s see

if you build some other model with some of the from two dependent variables now

we’ll go ahead and build the model and we’ll also break down values now after

painting the values we also have to calculate root mean square error the

mean square error exactly is little so let’s if some other model let’s see

model of that model let’s say 39 then this model would be better than the

second model similarly if there is let’s say model three whose root mean square

error is 19 then that model fee would be better than this model which you pay to

you so here we are taking the monthly charges values that is exactly right so

let me actually show you that so let’s see when I put in Y prett and let’s say

I’ll have a glance for the first five values right so these are the monthly

charges predicted let me also show you why test five okay

these are the actual values and these are the predicted values over here three

so this is not exactly dependent of a turn of the customer so what we are

doing this we are building an entire Terra science lifecycle process so here

we are trying to understand the relationship between the tenure of the

customer and the monthly status of the customer right so here what we’re

basically trying to understand is let’s see if the tenure of the customer is at

around ten months what would be his monthly charges again if the tenure of

the customer is thirty months what would be his monthly charges similarly the

tenure of the customer is 70 months then what would be his monthly charge so this

is what we are trying to understand over here right so this time we’d have to

build a logic regression model where dependent feasible is shown and

independent variables our tenure and monthly charges

again I do the same thing let me actually copy these two so this is what we did over here rather

tail right so this is where we went ahead and tricked at the values to entry

right so I have divided my dataset into training and testing set and over here I

am fitting the model only on the train set right so training is happening on

the training set or the model is learning from the train set and we are

predicting the values on the test set to this model this a new regression model

has not yet seen all of the records present in the test set it has only

learned from the values which are present in the training set so we’ll go

ahead and build a linear regression model customers shown let me also put in wine

do the same thing and the in the logic regression model

our independent variable is monthly charges

let me good monthly charges over here and the

largest retrogression showing the dependent really blessed shown so I have extracted or I have got my

features and my target over you so features obviously my monthly charges

and my target as present in the churn column and I’d want to understand if the

customer would shown or not on the basis of the monthly charges of the customer

right so I’ve done this and the rest of the process would be the same for you I

will go ahead and divide this data frame in dude train and dust all right so this

is 6535 is show this time so I’ll say to be thirty five so this nine thirty four

thirty five percent of the records would be present in the test set and the rest

sixty five percent of the records would be present in the train set I’m storing

all of those into extreme extras white rain and whitest red Alice I’d have to

import the lowest regression model so from SK learn dot linear model

I’d have to import the law district regression model and I’d have to create

an instance of this so I’ll name this to be equal to log model

right so I have created an instance of this and I’ll go ahead and fit the model

on top of the training set with the entire process where it comes to you

with everything this is pretty much the same rates of Python makes it extremely

easy the SQL own library makes it extremely easy all you have to do is

take in the data find out your independent variables and your dependent

variable and then go ahead and divide those target and features into training

and testing set build the model on top of the train set and then break the

values on top of the test set and then you will find out the metrics so for

classification its confusion matrix and your normal accuracy score and for

linear regression it could be the root mean square error it mean square error

or moving average 0 so log model dot FET and I will fit this

model on top of the training silligan so extreme and white rain

right so I have with model yes we can also have multiple independent variables

sure so and so if we have a single independent variable then that’s you

know that’s a simple model so it comes to simple linear regression or you have

a single independent variable when it comes to multiple linear regression

you’ll have multiple independent variables right so this basically means

that we are trying to understand how does our dependent variable change with

respect to multiple independent variables over there

it’s hagen comes to this equation so linear regression the equation is based

on this let me actually open woof paddle he’ll wait so y equals let’s say x1 plus x2 plus x3

plus x4 so what happens in a simple linear regression if you just have one

independent variable which would be x1 what happens in multiple linear

regression is you have multiple independent variables which are X 2 X 3

X 4 and so on right and you’re trying to understand

how does Y vary with respect to all of these independent variables that’s

pretty much it right so let’s proceed with this we have

built the model on top of the training set now let’s go ahead and predict the

values on top of the test set so it’ll be log model dot method so here we take the value of x and um we

are not changing the value of x we have multiple x values over here right so

this becomes M x1 plus you know M 2 X 2 plus M 3 X 3 plus M M 4 X 4 and so on

have multiple independent variables and you are trying to understand how

does y vary with those multiple independent variables so log model dot product and I want to

predict my use on top of the test set and I’ll

store the same again wipe head so we have predicted the values now

right so when it comes to a classification model we can use the

confusion matrix right so let me import the confusion matrix so from SK learn

dot matrix I will import the confusion matrix also the accuracy score

now let me find out both of this I’ll pass in the actual values and the

predicted values actual values are present in whitest and predicted values

are present and vybrid similarly accuracy score I’ll do the same thing

widest and wipe red right so this is our confusion matrix

and this is the accuracy rate so to get the accuracy what I basically do is you

will divide the left diagonal with all of the values so left diagonal

represents all of your correctly predicted classifications this come

Frye’s of all of your true positives rates this part over here this is all of

your true positives and this is all of your true negatives right so when you

divide this with the entire sample space that is when you will get the accuracy

so let me do that so that will be one eight one five divided by one eight one

five plus six five one and I get the same accuracy which is

seventy three point six zero right fine and curiously on seventy three percent

for the model which I’ve been I have to build a multiple loyalty

regression model while the dependent variable is the same and the independent

variables are tenure and monthly status so now our we have two independent

variables so I’ll make the changes here itself right so Charles the question

which you’re asking what if we had multiple independent variable stripes

over here we have two independent variables so this time the independent

variables are monthly charges and tenure and I am trying to understand whether

the customer would shown or not on the basis of these two columns which are

monthly charges and then you’re right so X now would comprise of these features

and Y is the same and the ratio is 8020 so I’ll change the test size to be equal

to 20 over you I’ll go ahead and fit the model it’s the same again I’ll shrink

the values is the same right so build the model I predicted the values the

only difference which I made over here is I have two independent variables this

time instead of a single independent variable right typing the values after

that I’ll import confusion metrics and accuracy score right now over here again

I’m calculating the confusion matrix and this and this time I get an accuracy of

seventy seven point five zero right so this time will be 935 plus 150 seven

divided by 935 plus 150 seven plus one hundred and six plus 211 right so we get

an accuracy of seventy seven point five zero so this left diagonal are all of

the values which have been correctly classified and this fried diagonal

represents all of those values which have been mixed which have been

misclassified so this where is the logistic regression

and then we’ve got two more machine learning algorithms left which are

decision free and random forests so let’s fill these two correct

so for decision tree dependency ribbon is the scene which is churn and

independent variable is tenure right so let me manually do with this

hexxus customer churn and the independent variable as tenure let me also extract the dependent

variable so the dependent variable would be shown

how to make sure that you bring the Spelling’s correct and also how to take

care of the small caps in the capital letters over you right now I will go

ahead and also import the decision tree classifier

so from a scale learned dot tree I’ll be importing the decision tree

classifier now I’ll go ahead and divide this data film into train and test so

it’ll be the same process let me copy this let me paste it over here

so the split as 80/20 which is what we are doing over here right so this is our

feature this is our target labels and we have divided the data set into training

and testing set now let me go ahead and create an instance of this decision tree

classifier I’ll name this instance to be let’s say my tree

right and I’ll go ahead and fit the model on top of the training set so my

tree dot fit the stakes in two parameters which are extreme and white

rain I fit the model now it’s time to break the values my tree dot predict and

I’ll be predicting on top of the X test now I’ll import the matrix right

so let me calculate the confusion matrix and also the accuracy score

confusion metrics so I’ll actually have to store this in our object first again

I’ll be storing this in to wipe read let me run the cell again and this again takes in two parameters

first parameter is all of the actual values which are present in whitest and

all the predicted values which are present and why Fred

so this is our confusion matrix now let me calculate the accuracy which would be

965 plus 87 divided by 965 plus 87 plus 280 ones plus 76 right

so for this decision tree model we get an accuracy of 74 percent right so the

processes entirely they’re seeing guys rate what we are doing this again at the

risk of being redundant so what we are basically doing us finding out our

features and target variable off large we are dividing the features and target

into training and testing split and then we’ll go ahead and build the model on

top of the train data and then we’ll take the values on top of the test data

once the prediction is done well calculate the accuracy for to find out

how well our model has learned right so the in the entire data science lifecycle

our most important part is the data processing part right so because to the

model building so it depends on your problem statement

the other term right so let’s see if your dependence on your prob it depends

on your problem statement and what exactly you are trying to find out so

let’s say if your dependent variable is continuous so you know if it’s a

continuous variable if it’s a continuous numerical then you will go with linear

regression and if you want you know if you if you have multiple categories

right so if you have multiple categories then you will go with either decision

tree or random forest so when it comes to logistically aggression it is a

binary classifier so let’s see if you are just two or you know two labels in

your dependent variable like this case over here right so this is where you

will go with caustic regression but then again if you have multiple categories or

if it’s a multi classification problem then you will most probably go with

decision tree and random forest and again when you compare the season tree

and random forest or random foresters always better than decision tree because

random forest is an ensemble model so again through your course you would have

learned that parandham forester is nothing but an ensemble of decision

trees and the accuracy or the prediction given by a random forest this better

than a decision tree so that you can actually take for granted right so what

is the accuracy of decision free gives the curiosity given by a random forest

is better so just to verify that let me actually go ahead and build a random

forest model on this same data over here right so we take the same x and y values know so I think a problem statement was

different out there so linear regression we used to understand how do the monthly

charges vary with tenure but the other algorithms which we’ve built that was to

understand how does churn vary with other factors right so over their

monthly charges was a numerical column but shown is a categorical column so we

can compare logistic regression and decision tree right so let’s see if we

compare logistically aggression and decision tree right so logistically

aggression girls are an accuracy of 77% yes yes right so logistically aggression

till now yes has given us the best accuracy right right so now finally we also go ahead

and build an ensemble model which is random forest

so let’s actually compare the accuracy given by this decision tree and the

random forest so from eske learned art ensemble i’ll import

the random forest classifier and I’ll create

an instance of this maybe I’ll name this as our and I’ll just create an instance you

correct and I’ll go ahead and put the model on top of X train and white ring extreme and white rain I fit the model now it’s finally time to pick the value

so our F dot credit and I am predicting the values on top of the X test let me

build the confusion matrix over here right now let me take the accuracy so

I’ll just type in accuracy score and i’ll bison pass and whitest and white

red quietest ripe red so you get an accuracy

of seventy four point six six and at seventy four point six six right so

that’s not much of a difference when you compare this region tree and random

forest over here right so we’ve got the same values right but normally in

general if you are building a random forest model so you’ll either get a

better accuracy or you get the same accuracy right so this is your normal

entire the assigns lifecycle you’ll start off with the data pre-processing

pod data exploration but where you will understand the structure of the data set

you will visualize the data set and understand whatever is happening

underneath it right after that once you understand and comprehend your data

properly that is when you will go ahead and lay your model again when it comes

to building your model you will sort of follow the same procedure will find out

your independent variables and the dependent variable then you’ll go ahead

and divide those two into your training and testing set we’ll build the model on

top of the training set and then you’ll break the values on top of the test set

and finally you will calculate the accuracy

very so um here

this is pretty much it for to be session right so this is your entire end-to-end

project on the customer churn data set which comprise of data manipulation data

visualization and implementing all the machine learning algorithms so this

brings us to the end of the session if you have any queries do comment below

we’ll reach out immediately also do subscribe to our channel so that you

don’t miss out on any of the upcoming videos

Thanks for fullfiled my wish intellipaat teamππ

I need dataset

Keep uploading this kind of videos

Guys, what else do you want to learn from Intellipaat? Comment down below and let us know so we can create more such tutorials for you.

MERN Stack development please. ..

You guys are really doing hard work for providing such a knowledgeable vedios. Thank you so much.

this video is good ,in need of videos on swift

Awesome!

Please make a video on complete Data structures for C and C++

Thank you….πππππ

Guys ethical hacking…

You guys are really awesome. Thanks a lot…

Thanks intellpaat, making a useful videos and please share me the file which you have used in the project and my mail Iβd [email protected]

The format of videos is poor brother.

Automation with Python plse

You are awesome plse make more video like this ππππ

Could you please share the problem statement doc and the model?

Really appreciate it. Amazing work not only work it just a combination of smart+hard work. Intellipat team is just amazing.

angular js career is good?

i think you should make videos on Projects of ML, DS, AI which will boast your channel popularity and people can understand the idea behind the concepts.

these videos contains all the concepts which we have learned in python language learning if you are uploading about 10 different videos on 10 different projects these series will help a lot where everyone can get the ideas

Thanks a lot for makeup this video

Plzz make some project real time video on DEVOPS also…

You must start a detailed course on digital marketing and also include the latest trends

Please provide me link to dataset

We need data set

Please provide Just a link

Could u pls provide with the dataset link

Team , excellent video.

Can you please help me with the following?

1. Where we can find the dataset used in this case study?

2., how do we get the toll tip, when we type some keywords like read_ drop down list appears. how to get this.

Please provide datasheet

hi , could u pls make full course video on php symfony 4?

Dude, you're like a mind reader. I was so looking for some Python projects!

Thanks for your good work. Please, can you help with the datasets

π Guys everyday we upload high quality in depth tutorial on your requested topic/technology so kindly SUBSCRIBE to our channelπ( http://bit.ly/Intellipaat ) & also share with your connections on social media to help them grow in their career.π

Dear Team,

Can you please provide data set which is used in project?

Team kindly share data set.

Can you please share the datasheet and problem statement document ?

Sir i am facing issue on logistics regression, the model can not fit., i run the same code but it shows the error. (value error)

Please send the dataset to [email protected]

please provide the problem statement and csv file