# Is job satisfaction in the description?

Analyzing job description text data to infer job satisfaction and stress

# Background

This is a homework assignment I completed for a graduate course in Data Science and Predictive Analysis at the University of Michigan in Fall 2018.

Problem 4.1

Construct an R protocol to examine the job-stress level (Stress_Level) and hiring-potential (Hiring_Potential) using the job description (JD) (Description) text.

# Set-Up Workspace

## Dataset Description

• The data we are using for homework 4 is the SOCR Dataset Ranking the Best and Worst USA Jobs for 2011.
• The variables in this data are
• Index: Ranks of the 200 most common jobs based on the Overall_Score variable (lower rank/index indicates better job, and higher rank indicates a less desirable job for 2011). Overall Rank Index represents the sum of the rankings in each of the five variable (below) where each of the variables are assumed (but this may not necessarily be true) to be equally important.
• Job Title: title of the profession
• Overall_Score: A Dependent variable corresponding to the linearly predicted overall rank of this job using the remaining explanatory variables (below).
• Average_Income (USD): An estimate of the mid-levels income for this profession.
• Work_Environment: Work environment for each job reflects 2 basic factors: physical and emotional components. Points are assigned for potential for adverse working conditions (larger scores). In other words, fewer work-environment points would yield a lower (better) job rank and higher points reflect lower quality environments.
• Stress_Level: Expected level of job-related stress dependent on 11 potential stress factors. A high stress score implies that stress is a major part of the job, and lower stress score indicates lower job-relates stress.
• Stress_Category: Categorical data where lower number seems to correspond to less stress while higher number is higher stress
• Physical_Demand: The total physical demand if a job. Higher number means more physically demanding.
• Hiring_Potential: The hiring-potential rank assigns higher scores to jobs with promising future outlook and lower scores for less future potential.
• Description: A brief job description for each profession.

## Download the data

``````# install.packages('rvest')
library(rvest)  #web scraping package

wiki_url <- read_html('http://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking#2011_Ranking_of_the_200_most_common_Jobs_in_the_US')

job_data <- html_table(html_nodes(wiki_url, "table")[])
``````

## View the data

``````# review data
library(DT)
``````
``````## Warning: package 'DT' was built under R version 3.4.4
``````
``````datatable(job_data, options = list(
autoWidth = TRUE,
columnDefs = list(list(width = '100%', targets = c(1, 3)))
))
``````

• We have 200 observations (200 jobs) and 10 variables (information about the jobs).
``````str(job_data)
``````
``````## 'data.frame':	200 obs. of  10 variables:
##  \$ Index              : int  1 2 3 4 5 6 7 8 9 10 ...
##  \$ Job_Title          : chr  "Software_Engineer" "Mathematician" "Actuary" "Statistician" ...
##  \$ Overall_Score      : int  60 73 123 129 147 175 182 192 195 197 ...
##  \$ Average_Income(USD): int  87140 94178 87204 73208 77153 85210 74278 63208 63144 67107 ...
##  \$ Work_Environment   : num  150 89.7 179.4 89.5 90.8 ...
##  \$ Stress_Level       : num  10.4 12.8 16 14.1 16.5 ...
##  \$ Stress_Category    : int  1 1 1 1 1 1 1 1 0 1 ...
##  \$ Physical_Demand    : num  5 3.97 3.97 3.95 5.08 6.98 4.98 5.09 7.43 7 ...
##  \$ Hiring_Potential   : num  27.4 19.8 17 11.1 15.5 ...
##  \$ Description        : chr  "Researches_designs_develops_and_maintains_software_systems_along_with_hardware_development_for_medical_scientif"| __truncated__ "Applies_mathematical_theories_and_formulas_to_teach_or_solve_problems_in_a_business_educational_or_industrial_climate" "Interprets_statistics_to_determine_probabilities_of_accidents_sickness_and_death_and_loss_of_property_from_thef"| __truncated__ "Tabulates_analyzes_and_interprets_the_numeric_results_of_experiments_and_surveys" ...
``````

## Prepare the data

• Remove Index variable, which contains the rank-ordering of the jobs
``````job_data\$Index <- NULL
``````
• Turn Stress_Category into a factor
``````# turn Stress_Category into a factor
job_data\$Stress_Category <- factor(job_data\$Stress_Category)
``````
• Remove some of the unneccessary punction, contractions, and conjunctions from the job description variable using `gsub`
``````# remove the underscores in the job descriptions using the gsub() function
job_data\$Description <- gsub("_", " ", job_data\$Description)

# remove the apostrophes and contractions
job_data\$Description <- gsub("'", "", job_data\$Description)

# remove all the "and", "the", and "for" because they aren't super relevant
job_data\$Description <- gsub(" and ", " ", job_data\$Description)
job_data\$Description <- gsub(" the ", " ", job_data\$Description)
job_data\$Description <- gsub(" for ", " ", job_data\$Description)
job_data\$Description <- gsub(" from ", " ", job_data\$Description)
job_data\$Description <- gsub(" with ", " ", job_data\$Description)
``````

# Process text data for analysis

• Install `tm` package
``````# install.packages("tm", repos = "http://cran.us.r-project.org")
# requires R V.3.3.1 +
library(tm)
``````
• First step for text mining is to convert text features (text elements) into a corpus object, which is a collection of text documents.
• From homework: Convert the textual JD meta-data into a corpus object.
``````job_data_corpus <- Corpus(VectorSource(job_data\$Description))
## VectorSource() interprets each element of the vector x as a document
## A Corpus is unique because Corpus objects store text alongside metadata, like author, datetimestamp, id, language, etc.

print(job_data_corpus)
``````
``````## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 200
``````
• Each document represents a single description for a job. We have 200 jobs in our dataset, thus 200 documents corresponding to 200 job descriptions.
``````inspect(job_data_corpus[1:3])
``````
``````## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
##
##  Researches designs develops maintains software systems along hardware development medical scientific industrial purposes
##  Applies mathematical theories formulas to teach or solve problems in a business educational or industrial climate
##  Interprets statistics to determine probabilities of accidents sickness death loss of property theft natural disasters
``````
• Use `tm_map()` function for cleaning the corpus document.
• From homework: Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lower case, etc.
``````corpus_clean <- tm_map(job_data_corpus, tolower)  ## changed all the characters to lower case

corpus_clean <- tm_map(corpus_clean, removePunctuation) ## removed all punctuations

corpus_clean <- tm_map(corpus_clean, removeNumbers) ## remove all numbers

corpus_clean <- tm_map(corpus_clean, stripWhitespace) ## remove all extra white spaces  (typically created by deleting punctuations)
``````
• Inspect the corpus to make sure everything looks alright
``````inspect(corpus_clean[21:27])
``````
``````## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 7
##
##  assesses hearing speech language disabilities provides treatment assists individuals communication disorders through diagnostic techniques
##  uses principles of physics mathematics to understand workings of universe
##  loan officers help people apply loans this lets people do things like buy a house or a car or pay college
##  plans drilling locations effective production methods optimal access to oil natural gas
##  assesses patients dietary needs plans menus instructs patients their families about proper nutritional care
##  transforms scientific technical information into readily understandable language
##  diagnoses visual disorders prescribes administers corrective rehabilitative treatments
``````
• `DocumentTermMatrix()` function can successfully tokenize the job description into words. It can count frequent terms in each document in the corpus object.
• From homework: Tokenize the job descriptions into words.
``````job_data_dtm <- DocumentTermMatrix(corpus_clean)
``````
• Inspect to make sure DocumentTermMatrix looks correct
``````head(job_data_dtm\$dimnames\$Terms)
``````
``````##  "along"       "designs"     "development" "develops"    "hardware"
##  "industrial"
``````
``````tail(job_data_dtm\$dimnames\$Terms)
``````
``````##  "framework" "raises"    "steel"     "pipelines" "rigs"      "shore"
``````
``````inspect(job_data_dtm[,1:5])
``````
``````## <<DocumentTermMatrix (documents: 200, terms: 5)>>
## Non-/sparse entries: 25/975
## Sparsity           : 98%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs along designs development develops hardware
##   1      1       1           1        1        1
##   14     0       0           0        1        0
##   17     0       0           0        1        0
##   19     0       1           0        1        0
##   49     0       1           0        0        0
##   5      0       0           0        1        0
##   51     0       1           0        1        0
##   52     0       0           1        0        0
##   62     0       0           0        1        0
##   69     0       0           0        1        0
``````

## Visualize Job Description Text

``````library(plotly)
``````
``````# Frequency
freq <- sort(colSums(as.matrix(job_data_dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)

# Plot Histogram
h1 <- subset(wf, freq>8)    %>%
ggplot(aes(word, freq)) +
geom_bar(stat="identity", fill="darkred", colour="darkblue") +
theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("Frequency Histogram")
ggplotly(h1)
``````

• From homework: Examine the distributions of Stress_Category and Hiring_Potential.
``````psych::describe(job_data\$Hiring_Potential)
``````
``````##    vars   n mean    sd median trimmed   mad    min   max range  skew
## X1    1 200 3.84 12.15   4.59    4.01 11.13 -40.76 37.05 77.81 -0.26
##    kurtosis   se
## X1     0.68 0.86
``````
``````table(job_data\$Stress_Category)
``````
``````##
##  0  1  2  3  4  5
## 12 85 64 24 13  2
``````
``````plot(job_data\$Stress_Category, main = "Stress Category Frequency")
`````` ``````q <- ggplot(job_data, aes(x=Hiring_Potential)) +
geom_histogram(binwidth=.5, colour="black", fill="white") +
geom_vline(aes(xintercept=mean(Hiring_Potential, na.rm=T)),   # Ignore NA values for mean
color="red", linetype="dashed", size=1) + ggtitle("Hiring Potential Frequency")
ggplotly(q)
``````

``````p <- ggplot(data=job_data, aes(x=Stress_Category, y = Hiring_Potential)) +
geom_bar(stat="identity") + ggtitle("Hiring Potential by Stress Category")

ggplotly(p)
``````

``````h2 <- ggplot(data=job_data, aes(x=Stress_Category, y = Hiring_Potential, fill = Stress_Category)) + geom_boxplot() + ggtitle("Boxplot of Hiring Potential by Stress Category") + guides(fill=FALSE)

ggplotly(h2)
``````

``````library(ggplot2)
g <- ggplot(job_data) + geom_point(aes(x=Hiring_Potential, y=Work_Environment, colour=Stress_Category))
library(plotly)
ggplotly(g)
``````

# Split the data 90:10 training:testing

``````set.seed(12345)
subset_int <- sample(nrow(job_data),floor(nrow(job_data)*0.9))  # 90% training + 10% testing
``````
``````job_data_train<-job_data[subset_int, ]
job_data_test<-job_data[-subset_int, ]

job_data_dtm_train<-job_data_dtm[subset_int, ]
job_data_dtm_test<-job_data_dtm[-subset_int, ]

corpus_train <- corpus_clean[subset_int]
corpus_test <- corpus_clean[-subset_int]
``````
``````round(prop.table(table(job_data_train\$Stress_Category)), digits = 2)
``````
``````##
##    0    1    2    3    4    5
## 0.07 0.43 0.31 0.12 0.07 0.01
``````
``````round(prop.table(table(job_data_test\$Stress_Category)), digits = 2)
``````
``````##
##    0    1    2    3    4    5
## 0.00 0.40 0.45 0.15 0.00 0.00
``````

# Binarize Job Stress

## Binarize the Job Stress for training

• Binarize the Job Stress into two categories (low/high stress levels)
• `a %in% b` is an intuitive interface to match and acts as a binary operator returning a logical vector (T or F) indicating if there is a match between the left and right operands.
``````plot(job_data_train\$Stress_Category)
`````` • Binarize the Job Stress into two categories (low/high stress levels)
``````job_data_train\$stress_binary <- job_data_train\$Stress_Category %in% c(3:5)
job_data_train\$stress_binary <- factor(job_data_train\$stress_binary, levels=c(F, T), labels = c("low_stress", "high_stress"))
``````

## Binarize the Job Stress for testing

``````plot(job_data_test\$Stress_Category)
`````` • Binarize the Job Stress into two categories (low/high stress levels)
``````job_data_test\$stress_binary <- job_data_test\$Stress_Category %in% c(3:5)
job_data_test\$stress_binary <- factor(job_data_test\$stress_binary, levels=c(F, T), labels = c("low_stress", "high_stress"))
``````

## Compare the proportions

``````prop.table(table(job_data_train\$stress_binary))
``````
``````##
##  low_stress high_stress
##         0.8         0.2
``````
``````prop.table(table(job_data_test\$stress_binary))
``````
``````##
##  low_stress high_stress
##        0.85        0.15
``````

# Word Cloud & Visualization

## Word Cloud to Visualize Training Data Job Desc

``````# install.packages("wordcloud", repos = "http://cran.us.r-project.org")
library(wordcloud)
``````
``````## Warning: package 'wordcloud' was built under R version 3.4.4
``````
``````## Loading required package: RColorBrewer
``````
``````library(RColorBrewer)
library(wesanderson)
``````
``````## Warning: package 'wesanderson' was built under R version 3.4.4
``````
``````pal <- rainbow(10)
library(wordcloud)
wordcloud(corpus_train, max.words =100, min.freq=10, scale=c(4,.5),
random.order = FALSE, rot.per=.5, vfont=c("sans serif","plain"), colors=pal)
`````` ## Visualize differences between low and high stress categories

``````low_stress<-subset(job_data_train, stress_binary=="low_stress")
wordcloud(low_stress\$Description, max.words = 20, min.freq=5, scale=c(4,.5),
random.order = FALSE, rot.per=.2, vfont=c("sans serif","plain"), colors=pal)
`````` ``````high_stress<-subset(job_data_train, stress_binary=="high_stress")
wordcloud(high_stress\$Description, max.words = 20, min.freq=2, scale=c(4,.5),
random.order = FALSE, rot.per=.2, vfont=c("sans serif","plain"), colors=pal)
`````` ``````# Words in job description by frequency
freq_dtm_train <- sort(colSums(as.matrix(job_data_dtm_train)), decreasing=TRUE)
freq_dtm_test <- sort(colSums(as.matrix(job_data_dtm_test)), decreasing=TRUE)
``````
``````ggplot(job_data_train, aes(x=Hiring_Potential)) +
geom_density(binwidth=.5, fill="white") + ggtitle("Hiring Potential Frequency")+ facet_wrap(~stress_binary)
``````
``````## Warning: Ignoring unknown parameters: binwidth
`````` # Word count into categorical data

• we are going to make frequencies of words into features.
• we will create indicators for words that appear at least in 3 different documents in the training data.
``````summary(findFreqTerms(job_data_dtm_train, 2))
``````
``````##    Length     Class      Mode
##       303 character character
``````
``````job_data_dict <- as.character(findFreqTerms(job_data_dtm_train, 3))
job_train <- DocumentTermMatrix(corpus_train, list(dictionary=job_data_dict))
job_test <- DocumentTermMatrix(corpus_test, list(dictionary=job_data_dict))
``````
• The Naive Bayes classifier trains on data with categorical features, as it uses frequency tables for learning the data affinities. To create the combinations of class and feature values comprising the frequency-table (matrix), all feature must be categorical.
``````convert_counts <- function(wordFreq) {
wordFreq <- ifelse(wordFreq > 0, 1, 0)
wordFreq <- factor(wordFreq, levels = c(0, 1), labels = c("No", "Yes"))
return(wordFreq)
}
``````
``````job_train <- apply(job_train, MARGIN = 2, convert_counts)
job_test <- apply(job_test, MARGIN = 2, convert_counts)

# Check the structure of the data
dim(job_train)
``````
``````##  180 142
``````
``````dim(job_test)
``````
``````##   20 142
``````

# Report sparsity

• High sparsity of about 97%, indicating mostly “No” values.
``````prop.table(table(job_train))
``````
``````## job_train
##         No        Yes
## 0.97296557 0.02703443
``````
``````prop.table(table(job_test))
``````
``````## job_test
##         No        Yes
## 0.97535211 0.02464789
``````

# Analyctics

## Naive Bayes

• Apply the Naive Bayes classifier on the high frequency terms to predict stress level (low/high).
``````# install.packages("e1071", repos = "http://cran.us.r-project.org")
library(e1071)
``````

### Build classifier

``````job_classifier <- naiveBayes(job_train, job_data_train\$stress_binary)
``````
``````job_test_pred<-predict(job_classifier, job_test)
``````

### Evaluate model performance

• Accuracy is 80%.
``````library(gmodels)
``````
``````table_NB1 <- CrossTable(job_test_pred, job_data_test\$stress_binary)
``````
``````##
##
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table:  20
##
##
##               | job_data_test\$stress_binary
## job_test_pred |  low_stress | high_stress |   Row Total |
## --------------|-------------|-------------|-------------|
##    low_stress |          16 |           3 |          19 |
##               |       0.001 |       0.008 |             |
##               |       0.842 |       0.158 |       0.950 |
##               |       0.941 |       1.000 |             |
##               |       0.800 |       0.150 |             |
## --------------|-------------|-------------|-------------|
##   high_stress |           1 |           0 |           1 |
##               |       0.026 |       0.150 |             |
##               |       1.000 |       0.000 |       0.050 |
##               |       0.059 |       0.000 |             |
##               |       0.050 |       0.000 |             |
## --------------|-------------|-------------|-------------|
##  Column Total |          17 |           3 |          20 |
##               |       0.850 |       0.150 |             |
## --------------|-------------|-------------|-------------|
##
##
``````
• Accuracy of Naive Bayes
``````acc_NB1 <- (table_NB1\$t[1,1]+table_NB1\$t[2,2])/dim(job_data_test)
print(acc_NB1)
``````
``````##  0.8
``````

## LDA Prediction

``````library(MASS)
``````
``````df_job_train <- data.frame(lapply(as.data.frame(job_train),as.numeric), stage = job_data_train\$stress_binary)
df_job_test <- data.frame(lapply(as.data.frame(job_test),as.numeric), stage = job_data_test\$stress_binary)

set.seed(1234)
job_lda <- lda(data=df_job_train, stage~.)
# hn_pred = predict(hn_lda, df_hn_test[,-104])
hn_pred = predict(job_lda, df_job_test)
``````
``````table_LDA1 <- CrossTable(hn_pred\$class, df_job_test\$stage)
``````
``````##
##
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table:  20
##
##
##               | df_job_test\$stage
## hn_pred\$class |  low_stress | high_stress |   Row Total |
## --------------|-------------|-------------|-------------|
##    low_stress |          14 |           2 |          16 |
##               |       0.012 |       0.067 |             |
##               |       0.875 |       0.125 |       0.800 |
##               |       0.824 |       0.667 |             |
##               |       0.700 |       0.100 |             |
## --------------|-------------|-------------|-------------|
##   high_stress |           3 |           1 |           4 |
##               |       0.047 |       0.267 |             |
##               |       0.750 |       0.250 |       0.200 |
##               |       0.176 |       0.333 |             |
##               |       0.150 |       0.050 |             |
## --------------|-------------|-------------|-------------|
##  Column Total |          17 |           3 |          20 |
##               |       0.850 |       0.150 |             |
## --------------|-------------|-------------|-------------|
##
##
``````
• Accuracy of LDA
``````# accuracy of LDA
acc_LDA1 <- (table_LDA1\$t[1,1]+table_LDA1\$t[2,2])/dim(job_data_test)
print(acc_LDA1)
``````
``````##  0.75
``````
• Error of LDA
``````# error of LDA
error_LDA1 <- (table_LDA1\$t[1,2]+table_LDA1\$t[2,1])/dim(job_data_test)
print(error_LDA1)
``````
``````##  0.25
``````
• Specificity: TN/(TN+FP)
``````# specifity of LDA
specificity_LDA1 <- (table_LDA1\$t[1,1]/(table_LDA1\$t[1,1] + table_LDA1\$t[2,1]))
specificity_LDA1
``````
``````##  0.8235294
``````
• Sensitivity: TP/(TP+FN)
``````sensitivity_LDA1 <- (table_LDA1\$t[2,1]/(table_LDA1\$t[2,1] + table_LDA1\$t[2,2]))
sensitivity_LDA1
``````
``````##  0.75
``````

## Decision Tree

``````# install.packages("C50")
library(C50)
``````
``````summary(job_data_train[,-c(1,5,6)])
``````
``````##  Overall_Score   Average_Income(USD) Work_Environment  Physical_Demand
##  Min.   : 60.0   Min.   : 18053      Min.   :  89.72   Min.   : 3.970
##  1st Qu.:367.0   1st Qu.: 33200      1st Qu.: 426.62   1st Qu.: 7.237
##  Median :528.5   Median : 46276      Median : 671.77   Median : 9.970
##  Mean   :505.8   Mean   : 54048      Mean   : 767.03   Mean   :12.868
##  3rd Qu.:645.0   3rd Qu.: 62458      3rd Qu.:1011.42   3rd Qu.:15.863
##  Max.   :892.0   Max.   :365258      Max.   :3314.03   Max.   :43.230
##  Hiring_Potential  Description            stress_binary
##  Min.   :-40.760   Length:180         low_stress :144
##  1st Qu.: -3.583   Class :character   high_stress: 36
##  Median :  4.325   Mode  :character
##  Mean   :  3.581
##  3rd Qu.: 10.575
##  Max.   : 37.050
``````
``````colnames(job_data_train)
``````
``````##   "Job_Title"           "Overall_Score"       "Average_Income(USD)"
##   "Work_Environment"    "Stress_Level"        "Stress_Category"
##   "Physical_Demand"     "Hiring_Potential"    "Description"
##  "stress_binary"
``````
``````set.seed(1234)
job_model <-C5.0(job_data_train[,-c(1,5,6,9)], job_data_train\$stress_binary)
job_model
``````
``````##
## Call:
## C5.0.default(x = job_data_train[, -c(1, 5, 6, 9)], y
##  = job_data_train\$stress_binary)
##
## Classification Tree
## Number of samples: 180
## Number of predictors: 6
##
## Tree size: 2
##
## Non-standard options: attempt to group attributes
``````
``````summary(job_model)
``````
``````##
## Call:
## C5.0.default(x = job_data_train[, -c(1, 5, 6, 9)], y
##  = job_data_train\$stress_binary)
##
##
## C5.0 [Release 2.07 GPL Edition]  	Wed Jun 19 19:04:20 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 180 cases (7 attributes) from undefined.data
##
## Decision tree:
##
## stress_binary = low_stress: low_stress (144)
## stress_binary = high_stress: high_stress (36)
##
##
## Evaluation on training data (180 cases):
##
## 	    Decision Tree
## 	  ----------------
## 	  Size      Errors
##
## 	     2    0( 0.0%)   <<
##
##
## 	   (a)   (b)    <-classified as
## 	  ----  ----
## 	   144          (a): class low_stress
## 	          36    (b): class high_stress
##
##
## 	Attribute usage:
##
## 	100.00%	stress_binary
##
##
## Time: 0.0 secs
``````
• Accuracy is 1
``````job_pred<-predict(job_model, job_data_test[,-c(1,5,6,9)])
library(caret)
``````
``````## Warning: package 'caret' was built under R version 3.4.4
``````
``````## Loading required package: lattice
``````
``````## Warning: package 'lattice' was built under R version 3.4.4
``````
``````confusionMatrix(table(job_pred, job_data_test\$stress_binary))
``````
``````## Confusion Matrix and Statistics
##
##
## job_pred      low_stress high_stress
##   low_stress          17           0
##   high_stress          0           3
##
##                Accuracy : 1
##                  95% CI : (0.8316, 1)
##     No Information Rate : 0.85
##     P-Value [Acc > NIR] : 0.03876
##
##                   Kappa : 1
##  Mcnemar's Test P-Value : NA
##
##             Sensitivity : 1.00
##             Specificity : 1.00
##          Pos Pred Value : 1.00
##          Neg Pred Value : 1.00
##              Prevalence : 0.85
##          Detection Rate : 0.85
##    Detection Prevalence : 0.85
##       Balanced Accuracy : 1.00
##
##        'Positive' Class : low_stress
##
``````

## Multivariate Linear Model

• To predict Overall job ranking (smaller is better). Generate some informative pairs plots. Use backward step-wise feature selection to simplify the model, report the AIC.
``````colnames(job_data)
``````
``````##  "Job_Title"           "Overall_Score"       "Average_Income(USD)"
##  "Work_Environment"    "Stress_Level"        "Stress_Category"
##  "Physical_Demand"     "Hiring_Potential"    "Description"
``````
``````fit <- lm(Overall_Score ~., data=job_data[,-c(1,9)])
summary(fit)
``````
``````##
## Call:
## lm(formula = Overall_Score ~ ., data = job_data[, -c(1, 9)])
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -217.915  -37.741   -2.335   34.054  221.189
##
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)            2.753e+02  2.122e+01  12.971  < 2e-16 ***
## `Average_Income(USD)` -1.200e-03  1.259e-04  -9.531  < 2e-16 ***
## Work_Environment       1.805e-01  1.718e-02  10.507  < 2e-16 ***
## Stress_Level           4.931e+00  1.461e+00   3.375 0.000897 ***
## Stress_Category1      -9.948e+00  2.034e+01  -0.489 0.625305
## Stress_Category2       9.258e+00  2.975e+01   0.311 0.755974
## Stress_Category3      -7.648e+00  4.278e+01  -0.179 0.858322
## Stress_Category4      -7.113e+01  5.587e+01  -1.273 0.204560
## Stress_Category5      -3.173e+02  8.602e+01  -3.688 0.000295 ***
## Physical_Demand        5.945e+00  7.765e-01   7.656  9.6e-13 ***
## Hiring_Potential      -5.925e+00  3.608e-01 -16.422  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.02 on 189 degrees of freedom
## Multiple R-squared:  0.9173,	Adjusted R-squared:  0.9129
## F-statistic: 209.7 on 10 and 189 DF,  p-value: < 2.2e-16
``````
``````plot(fit, which = 1:2)
``````  • AIC=1613.73
``````step(fit,direction = "backward")
``````
``````## Start:  AIC=1613.73
## Overall_Score ~ `Average_Income(USD)` + Work_Environment + Stress_Level +
##     Stress_Category + Physical_Demand + Hiring_Potential
##
##                         Df Sum of Sq     RSS    AIC
## <none>                                572051 1613.7
## - Stress_Level           1     34470  606521 1623.4
## - Stress_Category        5    187299  759350 1660.4
## - Physical_Demand        1    177420  749472 1665.8
## - `Average_Income(USD)`  1    274953  847004 1690.2
## - Work_Environment       1    334145  906196 1703.7
## - Hiring_Potential       1    816211 1388262 1789.0
``````
``````##
## Call:
## lm(formula = Overall_Score ~ `Average_Income(USD)` + Work_Environment +
##     Stress_Level + Stress_Category + Physical_Demand + Hiring_Potential,
##     data = job_data[, -c(1, 9)])
##
## Coefficients:
##           (Intercept)  `Average_Income(USD)`       Work_Environment
##              275.2575                -0.0012                 0.1805
##          Stress_Level       Stress_Category1       Stress_Category2
##                4.9308                -9.9480                 9.2579
##      Stress_Category3       Stress_Category4       Stress_Category5
##               -7.6480               -71.1279              -317.2594
##       Physical_Demand       Hiring_Potential
##                5.9454                -5.9251
`````` ##### Susannah Chandhok
###### Ph.D. Student in Social Psychology

Ph.D. student in social psychology at the University of Michigan in Ann Arbor studying technology, social connection, and well-being.