Decision Trees: A survey was sent to the employees of a large company to ask them the following questions:

  • Do you work in the data analytics department? (Y or N)
  • Are you above the age of 30? (Y or N)
  • Have you spent more than 5 years in this company? (Y or N)
  • Is your current gross income more than USD 50,000 per year? (Y or N)

The following table summarizes the responses to the survey. For each entry, “Number of Instances” represents the number of respondents having the corresponding values for the attributes Analytics

Department, Age>30, and Tenure>5.

Analytics Deparment Age>30 Tenure>5 Number of Instances of

Income > 50K

Number of Instances of

Income ≤ 50K

Y Y Y 25 0

N Y Y 15 0

Y N Y 10 5

Y Y N 0 0

N N Y 0 0

N Y N 25 15

Y N N 0 10

N N N 0 20

Given the data above, answer the following questions:

(a) Find support and confidence for the rule:

if Analytics Department = Y Then Income > 50K

(b) Find support and confidence for the rule:

if Analytics Department = Y and Tenure > 5 Then Income > 50K

(c) Using the 1-rule method discussed in class, find the relevant sets of classification rules for the target

variable by testing each of the input attributes Analytics Department, Age > 30, and Tenure > 5.

Which of these three sets of rules has the lowest misclassification rate?

(d) Considering Income>50K as the target variable, which of the attributes would you select as the

root in a decision tree that is constructed using the information gain impurity measure?

(e) Use the Gini index impurity measure and construct the full decision tree for this data set.

2

  1. Exploratory Data Analysis: This exercise relates to the household income and expense dataset available on Blackboard as “Inc Exp Data.csv”. The data was taken from Kaggle and has 7 variables related to the income and expense details of households The following table defines the variables in the data: Variable Name Description

Mthly HH Income Monthly household income

Mthly HH Expense Monthly household expenses

No of Fly Members Number of family members

Emi or Rent Amt Rent or mortgage installment amount

Annual HH Income Annual household income

Highest Qualified Member Academic qualification of highest qualified family member

No of Earning Members Number of earning family members

Load the dataset into R and answer the following questions:

(a) How many rows and columns are in the dataset?

(b) Convert the variable “Highest Qualified Member” to a factor variable. Print the summary of dataset

and explain the key points of the summary for “Mthly HH Income” and “Highest Qualified Member”.

(c) Calculate the mean and standard deviation of all numeric columns.

Hint: Use dplyr package to filter only numeric columns using the is.numeric filter and then generate

summary statistics.

(d) Calculate disposable income of households as the difference between monthly income and expenses.

Plot a histogram of disposable income with 10 breaks.

Hint: Use the hist function and look at the help file for the “breaks” argument

(e) Construct a boxplot for monthly household income against the highest qualified member in a household. Your boxplots should be in the sequence illiterate, undergraduate, professional, graduate, post-graduate.

Hint: You may need to redefine the levels of the factor variable “Highest Qualified Member”. Use

the levels argument in the factor command. Use the boxplot function. You should get 5 box plots

in the same chart.

(f) For families with no more than 4 family members, calculate average monthly household income

by highest qualified member using dplyr. Then, create a bar chart using ggplot2 demonstrating the

same information.

Hint: Use chaining for dplyr filter, group by and summarize and pass it to the ggplot function.

3

  1. Logistic Regression This exercise relates to the diabetes dataset available on Blackboard as diabetes.csv.

It contains demographic and medical data for 768 females over the age of 21. The variables are defined

below:

Variable Name Description

Pregnancies Number of times pregnant

Glucose Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure Diastolic blood pressure (mm Hg)

SkinThickness Triceps skin fold thickness (mm)

Insulin 2-Hour serum insulin (mu U/ml)

BMI Body mass index ((Weight (in kg))/(Height (in 𝑚

2

)))

DiabetesPedigreeFunction Diabetes pedigree function

Age Age (in years)

Outcome Class variable (0 if no diabetes, 1 if individual has diabetes)

Please answer the following questions:

(a) Load the data into R. Print the structure of the dataset and explain the output.

Hint: Use the read.csv and str commands. This can be done in 2 lines of code.

(b) Convert the variable Outcome into a factor variable. Print the frequency distribution of the Outcome

variable using the table command and explain what it means.

Hint: Use the as.factor and table commands. You only need two lines of code for this.

(c) Create your training set with a random selection of 70% of the rows in the dataset and your testing

set with the other 30%. Use seed value 123 for this randomization. Print the frequency distribution

of the outcome variable in both train and test data. Are the two datasets similar in terms of the

distribution of the outcome variable? Explain.

Hint: You can use the sample command for the split. You will also need the set.seed command.

(d) Train a logistic regression model on the training dataset. How many of the variables are significant?

Hint: Use the glm and summary commands to for this part.

(e) Generate predictions on the testing dataset using the model produced through logistic regression

in step 5. Report the confusion matrix of your logistic regression model on the train set when the

threshold is set to 0.25. Compute the accuracy, true positive rate, and false positive rate for the

model.

Hint: You can use predict function for generating testing predictions, an ifelse command to create

binary predictions, and table to create a confusion matrix. This should take only 3 lines of code.

(f) Generate ROC plots and precision recall plots for both, the training and the testing dataset. Report

the area under the curve and also attach the plots in your final submission. Provide brief explanations

of what each curve and their respective AUCs represent.

Hint: Use the ROCR library.

(g) An individual displays the following traits: pregnancies = 1, glucose = 130, blood pressure = 80,

skin thickness = 22, insulin = 100, BMI = 25, diabetes pedigree function = 0.5, age = 50. According

to your final model, what is the probability that the individual has diabetes? Show your working.

Note: This is a manual calculation. Do not do this part with R. You can round the coefficient

estimates to 2 decimal places for ease of work.

4

  1. Decision Trees: Following the steps defined below, create a decision tree model to predict whether an

individual has diabetes:

(a) Using the training and testing partitions created in question 3, create a decision tree model on your

training data to predict the ”Outcome” variable using the rpart function. In your console, print

the decision tree model just made and explain how to read the output and what each value means.

You don’t have to explain every node. Just a few terminal nodes to show you understand how to

interpret the output.

Hint: Use the rpart library for this part.

(b) There are some parameters that control how the decision tree model works. These can be accessed

in the help file of rpart. Create a decision tree model where every terminal node has at least 25

observations. Do you notice any difference between this model and the model created in part (6)

above? Explain.

Hint: Type ”?rpart” to bring up the help file and scroll down to controls. You will see a hyperlink

titled ”rpart.control”. Click on the hyperlink and read the help file.

(c) Plot the decision tree model obtained in part (b) of this question using rpart.plot.

(d) Predict the probability of having diabetes for each observation in both training and test data. Create

the ROC plot and precision recall curves and report the area under the curve for all curves.

Hint: You can use the predict function and ROCR library as in Q2.

(e) Compare the output of part (d) of this question to part (f) of question 3. Which model is better?

Why?

(f) An individual displays the following traits: pregnancies = 1, glucose = 130, blood pressure = 80,

skin thickness = 22, insulin = 100, BMI = 25, diabetes pedigree function = 0.5, age = 50. According

to your final model from part (4) above, what is the probability that the individual has diabetes?

Explain.

  1. Decision Trees: A survey was sent to the employees of a large company to ask them the following

questions:

  • Do you work in the data analytics department? (Y or N)
  • Are you above the age of 30? (Y or N)
  • Have you spent more than 5 years in this company? (Y or N)
  • Is your current gross income more than USD 50,000 per year? (Y or N)

The following table summarizes the responses to the survey. For each entry, “Number of Instances”

represents the number of respondents having the corresponding values for the attributes Analytics

Department, Age>30, and Tenure>5.

Analytics Deparment Age>30 Tenure>5 Number of Instances of

Income > 50K

Number of Instances of

Income ≤ 50K

Y Y Y 25 0

N Y Y 15 0

Y N Y 10 5

Y Y N 0 0

N N Y 0 0

N Y N 25 15

Y N N 0 10

N N N 0 20

Given the data above, answer the following questions:

(a) Find support and confidence for the rule:

if Analytics Department = Y Then Income > 50K

(b) Find support and confidence for the rule:

if Analytics Department = Y and Tenure > 5 Then Income > 50K

(c) Using the 1-rule method discussed in class, find the relevant sets of classification rules for the target

variable by testing each of the input attributes Analytics Department, Age > 30, and Tenure > 5.

Which of these three sets of rules has the lowest misclassification rate?

(d) Considering Income>50K as the target variable, which of the attributes would you select as the

root in a decision tree that is constructed using the information gain impurity measure?

(e) Use the Gini index impurity measure and construct the full decision tree for this data set.

2

  1. Exploratory Data Analysis: This exercise relates to the household income and expense dataset available on Blackboard as “Inc Exp Data.csv”. The data was taken from Kaggle and has 7 variables related to the income and expense details of households The following table defines the variables in the data: Variable Name Description

Mthly HH Income Monthly household income

Mthly HH Expense Monthly household expenses

No of Fly Members Number of family members

Emi or Rent Amt Rent or mortgage installment amount

Annual HH Income Annual household income

Highest Qualified Member Academic qualification of highest qualified family member

No of Earning Members Number of earning family members

Load the dataset into R and answer the following questions:

(a) How many rows and columns are in the dataset?

(b) Convert the variable “Highest Qualified Member” to a factor variable. Print the summary of dataset

and explain the key points of the summary for “Mthly HH Income” and “Highest Qualified Member”.

(c) Calculate the mean and standard deviation of all numeric columns.

Hint: Use dplyr package to filter only numeric columns using the is.numeric filter and then generate

summary statistics.

(d) Calculate disposable income of households as the difference between monthly income and expenses.

Plot a histogram of disposable income with 10 breaks.

Hint: Use the hist function and look at the help file for the “breaks” argument

(e) Construct a boxplot for monthly household income against the highest qualified member in a household. Your boxplots should be in the sequence illiterate, undergraduate, professional, graduate, post-graduate.

Hint: You may need to redefine the levels of the factor variable “Highest Qualified Member”. Use

the levels argument in the factor command. Use the boxplot function. You should get 5 box plots

in the same chart.

(f) For families with no more than 4 family members, calculate average monthly household income

by highest qualified member using dplyr. Then, create a bar chart using ggplot2 demonstrating the

same information.

Hint: Use chaining for dplyr filter, group by and summarize and pass it to the ggplot function.

3

  1. Logistic Regression This exercise relates to the diabetes dataset available on Blackboard as diabetes.csv.

It contains demographic and medical data for 768 females over the age of 21. The variables are defined

below:

Variable Name Description

Pregnancies Number of times pregnant

Glucose Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure Diastolic blood pressure (mm Hg)

SkinThickness Triceps skin fold thickness (mm)

Insulin 2-Hour serum insulin (mu U/ml)

BMI Body mass index ((Weight (in kg))/(Height (in 𝑚

2

)))

DiabetesPedigreeFunction Diabetes pedigree function

Age Age (in years)

Outcome Class variable (0 if no diabetes, 1 if individual has diabetes)

Please answer the following questions:

(a) Load the data into R. Print the structure of the dataset and explain the output.

Hint: Use the read.csv and str commands. This can be done in 2 lines of code.

(b) Convert the variable Outcome into a factor variable. Print the frequency distribution of the Outcome

variable using the table command and explain what it means.

Hint: Use the as.factor and table commands. You only need two lines of code for this.

(c) Create your training set with a random selection of 70% of the rows in the dataset and your testing

set with the other 30%. Use seed value 123 for this randomization. Print the frequency distribution

of the outcome variable in both train and test data. Are the two datasets similar in terms of the

distribution of the outcome variable? Explain.

Hint: You can use the sample command for the split. You will also need the set.seed command.

(d) Train a logistic regression model on the training dataset. How many of the variables are significant?

Hint: Use the glm and summary commands to for this part.

(e) Generate predictions on the testing dataset using the model produced through logistic regression

in step 5. Report the confusion matrix of your logistic regression model on the train set when the

threshold is set to 0.25. Compute the accuracy, true positive rate, and false positive rate for the

model.

Hint: You can use predict function for generating testing predictions, an ifelse command to create

binary predictions, and table to create a confusion matrix. This should take only 3 lines of code.

(f) Generate ROC plots and precision recall plots for both, the training and the testing dataset. Report

the area under the curve and also attach the plots in your final submission. Provide brief explanations

of what each curve and their respective AUCs represent.

Hint: Use the ROCR library.

(g) An individual displays the following traits: pregnancies = 1, glucose = 130, blood pressure = 80,

skin thickness = 22, insulin = 100, BMI = 25, diabetes pedigree function = 0.5, age = 50. According

to your final model, what is the probability that the individual has diabetes? Show your working.

Note: This is a manual calculation. Do not do this part with R. You can round the coefficient

estimates to 2 decimal places for ease of work.

4

  1. Decision Trees: Following the steps defined below, create a decision tree model to predict whether an

individual has diabetes:

(a) Using the training and testing partitions created in question 3, create a decision tree model on your

training data to predict the ”Outcome” variable using the rpart function. In your console, print

the decision tree model just made and explain how to read the output and what each value means.

You don’t have to explain every node. Just a few terminal nodes to show you understand how to

interpret the output.

Hint: Use the rpart library for this part.

(b) There are some parameters that control how the decision tree model works. These can be accessed

in the help file of rpart. Create a decision tree model where every terminal node has at least 25

observations. Do you notice any difference between this model and the model created in part (6)

above? Explain.

Hint: Type ”?rpart” to bring up the help file and scroll down to controls. You will see a hyperlink

titled ”rpart.control”. Click on the hyperlink and read the help file.

(c) Plot the decision tree model obtained in part (b) of this question using rpart.plot.

(d) Predict the probability of having diabetes for each observation in both training and test data. Create

the ROC plot and precision recall curves and report the area under the curve for all curves.

Hint: You can use the predict function and ROCR library as in Q2.

(e) Compare the output of part (d) of this question to part (f) of question 3. Which model is better?

Why?

(f) An individual displays the following traits: pregnancies = 1, glucose = 130, blood pressure = 80,

skin thickness = 22, insulin = 100, BMI = 25, diabetes pedigree function = 0.5, age = 50. According

to your final model from part (4) above, what is the probability that the individual has diabetes?

Explain.

 


    Customer Area

    Make your order right away

    Confidentiality and privacy guaranteed

    satisfaction guaranteed