Working with categorical data in R using Tidyverse
In addition to forcasts, we will also use other tidyverse packages, including ggplot2, dplyr, stringr, and tidyr along with two sample datasets- fivethirtyeight flight dataset and Kaggle’s State of Data Science and ML Survey.
The libraries we will be using in this blog are
Intro to factors variables
Let’s check out the datasets before we start our analysis
flying_etiquette <- fread("~/Desktop/R_tutorials/data/flying-etiquette.csv") %>%
as_tibble()
str(flying_etiquette)
## tibble [1,040 × 27] (S3: tbl_df/tbl/data.frame)
## $ RespondentID :integer64 [1:1040] 3436139758 3434278696 3434275578 3434268208 3434250245 3434245875 3434235351 3434218031 ...
## $ How often do you travel by plane? : chr [1:1040] "Once a year or less" "Once a year or less" "Once a year or less" "Once a year or less" ...
## $ Do you ever recline your seat when you fly? : chr [1:1040] "" "About half the time" "Usually" "Always" ...
## $ How tall are you? : chr [1:1040] "" "6'3\"\"" "5'8\"\"" "5'11\"\"" ...
## $ Do you have any children under 18? : chr [1:1040] "" "Yes" "No" "No" ...
## $ In a row of three seats, who should get to use the two arm rests? : chr [1:1040] "" "The arm rests should be shared" "Whoever puts their arm on the arm rest first" "The arm rests should be shared" ...
## $ In a row of two seats, who should get to use the middle arm rest? : chr [1:1040] "" "The arm rests should be shared" "The arm rests should be shared" "The arm rests should be shared" ...
## $ Who should have control over the window shade? : chr [1:1040] "" "Everyone in the row should have some say" "The person in the window seat should have exclusive control" "Everyone in the row should have some say" ...
## $ Is itrude to move to an unsold seat on a plane? : chr [1:1040] "" "No, not rude at all" "No, not rude at all" "No, not rude at all" ...
## $ Generally speaking, is it rude to say more than a few words tothe stranger sitting next to you on a plane? : chr [1:1040] "" "No, not at all rude" "No, not at all rude" "No, not at all rude" ...
## $ On a 6 hour flight from NYC to LA, how many times is it acceptable to get up if you're not in an aisle seat? : chr [1:1040] "" "Twice" "Three times" "Three times" ...
## $ Under normal circumstances, does a person who reclines their seat during a flight have any obligation to the person sitting behind them?: chr [1:1040] "" "Yes, they should not recline their chair if the person behind them asks them not to" "Yes, they should not recline their chair if the person behind them asks them not to" "No, the person on the flight has no obligation to the person behind them" ...
## $ Is itrude to recline your seat on a plane? : chr [1:1040] "" "Yes, somewhat rude" "No, not rude at all" "No, not rude at all" ...
## $ Given the opportunity, would you eliminate the possibility of reclining seats on planes entirely? : chr [1:1040] "" "No" "No" "No" ...
## $ Is it rude to ask someone to switch seats with you in order to be closer to friends? : chr [1:1040] "" "No, not at all rude" "No, not at all rude" "Yes, somewhat rude" ...
## $ Is itrude to ask someone to switch seats with you in order to be closer to family? : chr [1:1040] "" "No, not at all rude" "No, not at all rude" "No, not at all rude" ...
## $ Is it rude to wake a passenger up if you are trying to go to the bathroom? : chr [1:1040] "" "No, not at all rude" "No, not at all rude" "No, not at all rude" ...
## $ Is itrude to wake a passenger up if you are trying to walk around? : chr [1:1040] "" "No, not at all rude" "Yes, somewhat rude" "Yes, somewhat rude" ...
## $ In general, is itrude to bring a baby on a plane? : chr [1:1040] "" "No, not at all rude" "Yes, somewhat rude" "Yes, somewhat rude" ...
## $ In general, is it rude to knowingly bring unruly children on a plane? : chr [1:1040] "" "No, not at all rude" "Yes, very rude" "Yes, very rude" ...
## $ Have you ever used personal electronics during take off or landing in violation of a flight attendant's direction? : chr [1:1040] "" "No" "No" "No" ...
## $ Have you ever smoked a cigarette in an airplane bathroom when it was against the rules? : chr [1:1040] "" "No" "No" "No" ...
## $ Gender : chr [1:1040] "" "Male" "Male" "Male" ...
## $ Age : chr [1:1040] "" "30-44" "30-44" "30-44" ...
## $ Household Income : chr [1:1040] "" "" "$100,000 - $149,999" "$0 - $24,999" ...
## $ Education : chr [1:1040] "" "Graduate degree" "Bachelor degree" "Bachelor degree" ...
## $ Location (Census Region) : chr [1:1040] "" "Pacific" "Pacific" "Pacific" ...
## - attr(*, ".internal.selfref")=<externalptr>
kaggle_sur <- fread("~/Desktop/R_tutorials/data/kaggle_sur.csv") %>%
as_tibble()
str(kaggle_sur)
## tibble [16,716 × 47] (S3: tbl_df/tbl/data.frame)
## $ LearningPlatformUsefulnessArxiv : chr [1:16716] NA NA "Very useful" NA ...
## $ LearningPlatformUsefulnessBlogs : chr [1:16716] NA NA NA "Very useful" ...
## $ LearningPlatformUsefulnessCollege : chr [1:16716] NA NA "Somewhat useful" "Very useful" ...
## $ LearningPlatformUsefulnessCompany : chr [1:16716] NA NA NA NA ...
## $ LearningPlatformUsefulnessConferences : chr [1:16716] "Very useful" NA NA "Very useful" ...
## $ LearningPlatformUsefulnessFriends : chr [1:16716] NA NA NA "Very useful" ...
## $ LearningPlatformUsefulnessKaggle : chr [1:16716] NA "Somewhat useful" "Somewhat useful" NA ...
## $ LearningPlatformUsefulnessNewsletters : chr [1:16716] NA NA NA NA ...
## $ LearningPlatformUsefulnessCommunities : chr [1:16716] NA NA NA NA ...
## $ LearningPlatformUsefulnessDocumentation : chr [1:16716] NA NA NA "Very useful" ...
## $ LearningPlatformUsefulnessCourses : chr [1:16716] NA NA "Very useful" "Very useful" ...
## $ LearningPlatformUsefulnessProjects : chr [1:16716] NA NA NA "Very useful" ...
## $ LearningPlatformUsefulnessPodcasts : chr [1:16716] "Very useful" NA NA NA ...
## $ LearningPlatformUsefulnessSO : chr [1:16716] NA NA NA NA ...
## $ LearningPlatformUsefulnessTextbook : chr [1:16716] NA NA NA NA ...
## $ LearningPlatformUsefulnessTradeBook : chr [1:16716] "Somewhat useful" NA NA NA ...
## $ LearningPlatformUsefulnessTutoring : chr [1:16716] NA NA NA NA ...
## $ LearningPlatformUsefulnessYouTube : chr [1:16716] NA NA "Very useful" NA ...
## $ CurrentJobTitleSelect : chr [1:16716] "DBA/Database Engineer" NA NA "Operations Research Practitioner" ...
## $ MLMethodNextYearSelect : chr [1:16716] "Random Forests" "Random Forests" "Deep learning" "Neural Nets" ...
## $ WorkChallengeFrequencyPolitics : chr [1:16716] "Rarely" NA NA "Often" ...
## $ WorkChallengeFrequencyUnusedResults : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyUnusefulInstrumenting: chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyDeployment : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyDirtyData : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyExplaining : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyPass : chr [1:16716] NA NA NA NA ...
## $ WorkChallengeFrequencyIntegration : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyTalent : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyDataFunds : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyDomainExpertise : chr [1:16716] NA NA NA "Most of the time" ...
## $ WorkChallengeFrequencyML : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyTools : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyExpectations : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyITCoordination : chr [1:16716] NA NA NA NA ...
## $ WorkChallengeFrequencyHiringFunds : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyPrivacy : chr [1:16716] "Often" NA NA "Often" ...
## $ WorkChallengeFrequencyScaling : chr [1:16716] "Most of the time" NA NA "Often" ...
## $ WorkChallengeFrequencyEnvironments : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyClarity : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyDataAccess : chr [1:16716] NA NA NA "Often" ...
## $ WorkChallengeFrequencyOtherSelect : chr [1:16716] NA NA NA NA ...
## $ WorkInternalVsExternalTools : chr [1:16716] "Do not know" NA NA "Entirely internal" ...
## $ FormalEducation : chr [1:16716] "Bachelor's degree" "Master's degree" "Master's degree" "Master's degree" ...
## $ Age : int [1:16716] NA 30 28 56 38 46 35 22 43 33 ...
## $ DataScienceIdentitySelect : chr [1:16716] "Yes" "Yes" "Yes" "Yes" ...
## $ JobSatisfaction : chr [1:16716] "5" NA NA "10 - Highly Satisfied" ...
## - attr(*, ".internal.selfref")=<externalptr>
We have columns of various datatypes, the goal of this post is to work with categorical features (i.e columns that are of type factors). To check whether a particular variable is a factor or not, we simply pass the data and the variable to is.factor()
and we get a boolean as an output
We get FALSE
which means we will need to convert this to a factor.
You are probabaly aware of mutate()
and summarise()
in dply, here we look at a slight modified version of these - mutate_if()
, summarise_if
, mutate_all()
and summarise_all
. mutate_if()
and summarise_if
apply their second argument, a function, to all columns where the first argument is true, and mutate_all()
and summarise_all()
take one argument, a function, and apply it to all columns. Let’s convert all the columns that are characters (note first argument has to be true) to factors
responses_as_factors <- kaggle_sur %>%
mutate_if(is.character, as.factor)
str(responses_as_factors)
## tibble [16,716 × 47] (S3: tbl_df/tbl/data.frame)
## $ LearningPlatformUsefulnessArxiv : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA 3 NA 3 NA 2 NA NA 2 ...
## $ LearningPlatformUsefulnessBlogs : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA 3 NA NA 2 NA 3 2 ...
## $ LearningPlatformUsefulnessCollege : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA 2 3 NA NA NA 3 NA NA ...
## $ LearningPlatformUsefulnessCompany : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA NA NA NA 3 NA ...
## $ LearningPlatformUsefulnessConferences : Factor w/ 3 levels "Not Useful","Somewhat useful",..: 3 NA NA 3 2 NA NA NA 3 2 ...
## $ LearningPlatformUsefulnessFriends : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA 3 NA NA NA NA 3 NA ...
## $ LearningPlatformUsefulnessKaggle : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA 2 2 NA 2 3 2 3 3 2 ...
## $ LearningPlatformUsefulnessNewsletters : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA NA NA NA 3 NA ...
## $ LearningPlatformUsefulnessCommunities : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA NA NA NA 3 2 ...
## $ LearningPlatformUsefulnessDocumentation : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA 3 NA NA NA NA NA 3 ...
## $ LearningPlatformUsefulnessCourses : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA 3 3 NA 3 NA 3 3 3 ...
## $ LearningPlatformUsefulnessProjects : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA 3 NA NA 2 NA NA 3 ...
## $ LearningPlatformUsefulnessPodcasts : Factor w/ 3 levels "Not Useful","Somewhat useful",..: 3 NA NA NA NA NA NA NA NA 2 ...
## $ LearningPlatformUsefulnessSO : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA 3 NA 3 NA 2 ...
## $ LearningPlatformUsefulnessTextbook : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA 2 3 3 NA NA 3 ...
## $ LearningPlatformUsefulnessTradeBook : Factor w/ 3 levels "Not Useful","Somewhat useful",..: 2 NA NA NA NA NA NA NA NA NA ...
## $ LearningPlatformUsefulnessTutoring : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA NA NA NA NA NA ...
## $ LearningPlatformUsefulnessYouTube : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA 3 NA NA NA NA 3 3 2 ...
## $ CurrentJobTitleSelect : Factor w/ 16 levels "Business Analyst",..: 6 NA NA 9 2 5 2 15 1 15 ...
## $ MLMethodNextYearSelect : Factor w/ 25 levels "Anomaly Detection",..: 17 17 6 14 23 9 23 6 11 6 ...
## $ WorkChallengeFrequencyPolitics : Factor w/ 4 levels "Most of the time",..: 3 NA NA 2 2 NA NA NA NA NA ...
## $ WorkChallengeFrequencyUnusedResults : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 4 NA NA NA NA 4 ...
## $ WorkChallengeFrequencyUnusefulInstrumenting: Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
## $ WorkChallengeFrequencyDeployment : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
## $ WorkChallengeFrequencyDirtyData : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA 1 NA NA 2 2 ...
## $ WorkChallengeFrequencyExplaining : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA 2 ...
## $ WorkChallengeFrequencyPass : Factor w/ 4 levels "Most of the time",..: NA NA NA NA NA NA NA NA NA NA ...
## $ WorkChallengeFrequencyIntegration : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
## $ WorkChallengeFrequencyTalent : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 4 NA NA NA NA 2 ...
## $ WorkChallengeFrequencyDataFunds : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 4 NA NA NA NA NA ...
## $ WorkChallengeFrequencyDomainExpertise : Factor w/ 4 levels "Most of the time",..: NA NA NA 1 4 NA NA NA NA 4 ...
## $ WorkChallengeFrequencyML : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
## $ WorkChallengeFrequencyTools : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA 3 ...
## $ WorkChallengeFrequencyExpectations : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
## $ WorkChallengeFrequencyITCoordination : Factor w/ 4 levels "Most of the time",..: NA NA NA NA 4 NA NA NA NA NA ...
## $ WorkChallengeFrequencyHiringFunds : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA 4 ...
## $ WorkChallengeFrequencyPrivacy : Factor w/ 4 levels "Most of the time",..: 2 NA NA 2 1 NA NA NA NA 3 ...
## $ WorkChallengeFrequencyScaling : Factor w/ 4 levels "Most of the time",..: 1 NA NA 2 NA NA NA NA NA 3 ...
## $ WorkChallengeFrequencyEnvironments : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 4 NA NA NA NA NA ...
## $ WorkChallengeFrequencyClarity : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
## $ WorkChallengeFrequencyDataAccess : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
## $ WorkChallengeFrequencyOtherSelect : Factor w/ 4 levels "Most of the time",..: NA NA NA NA NA NA NA NA NA NA ...
## $ WorkInternalVsExternalTools : Factor w/ 6 levels "Approximately half internal and half external",..: 2 NA NA 4 1 6 4 NA 4 2 ...
## $ FormalEducation : Factor w/ 7 levels "Bachelor's degree",..: 1 5 5 5 2 2 5 1 1 1 ...
## $ Age : int [1:16716] NA 30 28 56 38 46 35 22 43 33 ...
## $ DataScienceIdentitySelect : Factor w/ 3 levels "No","Sort of (Explain more)",..: 3 3 3 3 1 NA 1 1 1 2 ...
## $ JobSatisfaction : Factor w/ 11 levels "1 - Highly Dissatisfied",..: 6 NA NA 2 3 9 9 NA 8 8 ...
## - attr(*, ".internal.selfref")=<externalptr>
Now we can find the number of levels in each column with the following code
number_of_levels <- responses_as_factors %>%
summarise_all(nlevels) %>%
gather(variable, num_levels) # just to format the data from wide to long
number_of_levels
## # A tibble: 47 x 2
## variable num_levels
## <chr> <int>
## 1 LearningPlatformUsefulnessArxiv 3
## 2 LearningPlatformUsefulnessBlogs 3
## 3 LearningPlatformUsefulnessCollege 3
## 4 LearningPlatformUsefulnessCompany 3
## 5 LearningPlatformUsefulnessConferences 3
## 6 LearningPlatformUsefulnessFriends 3
## 7 LearningPlatformUsefulnessKaggle 3
## 8 LearningPlatformUsefulnessNewsletters 3
## 9 LearningPlatformUsefulnessCommunities 3
## 10 LearningPlatformUsefulnessDocumentation 3
## # … with 37 more rows
we can also look at 3 rows with the highest number of levels
number_of_levels %>% top_n(3, num_levels)
## # A tibble: 3 x 2
## variable num_levels
## <chr> <int>
## 1 CurrentJobTitleSelect 16
## 2 MLMethodNextYearSelect 25
## 3 JobSatisfaction 11
If we want to look at the number of level of a specific variable we can do
number_of_levels %>%
filter(variable == "CurrentJobTitleSelect") %>%
pull(num_levels) # extracts the value of that column/variable
## [1] 16
or we can even see what levels there are in this column
responses_as_factors %>%
pull(CurrentJobTitleSelect) %>%
levels()
## [1] "Business Analyst"
## [2] "Computer Scientist"
## [3] "Data Analyst"
## [4] "Data Miner"
## [5] "Data Scientist"
## [6] "DBA/Database Engineer"
## [7] "Engineer"
## [8] "Machine Learning Engineer"
## [9] "Operations Research Practitioner"
## [10] "Other"
## [11] "Predictive Modeler"
## [12] "Programmer"
## [13] "Researcher"
## [14] "Scientist/Researcher"
## [15] "Software Developer/Software Engineer"
## [16] "Statistician"
and we can plot these levels as a bar plot since the column is categorical
ggplot(kaggle_sur, aes(CurrentJobTitleSelect)) +
geom_bar() +
coord_flip() +
xlab("Current Job Title")
note, how the plot is unordered and sometimes hard to compare each of the levels. We can make the bar plot ordered using fct_infreq()
from the forcats package. Just an fyi, anything that starts with fct comes from the forcats package
ggplot(kaggle_sur, aes(fct_rev(fct_infreq(CurrentJobTitleSelect)))) +
geom_bar() +
coord_flip() +
xlab("Current Job Title")
also here I’ve used fct_rev()
to reverse the order of the bar plot, so from top to bottom it goes highest count to lowest.
We can look at the mean age of the people with each of these job title. Here we make use of fct_reorder()
to order one varaiable based on another
kaggle_sur %>%
filter(!is.na(Age) & !is.na(CurrentJobTitleSelect)) %>% # don't include NA's
group_by(CurrentJobTitleSelect) %>%
summarise(mean_age = mean(Age)) %>%
mutate(CurrentJobTitleSelect =
fct_reorder(CurrentJobTitleSelect,mean_age)) %>% # reorder job title based on mean age column
ggplot(aes(x = CurrentJobTitleSelect, y = mean_age)) +
geom_point() +
coord_flip()
Manipulating factor variables
recode - to change bunch of values in a col to something else. Remember that when recoding numeric variables, you need to put the old value in backticks. .default option see Working with Data in the Tidyverse n parse_number.
Creating factors variable
Case study
Expression | Does this |
---|---|
. | matches any character |
* | zero or more times |