This blog post looks at how to work with non-numerical data, such as job titles, survey responses, or demographic information. R has a special way of representing them, called factors, and in this post we will look at how to deal with them using the tidyverse package forcats.

In addition to forcasts, we will also use other tidyverse packages, including ggplot2, dplyr, stringr, and tidyr along with two sample datasets- fivethirtyeight flight dataset and Kaggle’s State of Data Science and ML Survey.

The libraries we will be using in this blog are

Intro to factors variables

Let’s check out the datasets before we start our analysis

flying_etiquette <- fread("~/Desktop/R_tutorials/data/flying-etiquette.csv") %>% 
    as_tibble()
str(flying_etiquette)
## tibble [1,040 × 27] (S3: tbl_df/tbl/data.frame)
##  $ RespondentID                                                                                                                            :integer64 [1:1040] 3436139758 3434278696 3434275578 3434268208 3434250245 3434245875 3434235351 3434218031 ... 
##  $ How often do you travel by plane?                                                                                                       : chr [1:1040] "Once a year or less" "Once a year or less" "Once a year or less" "Once a year or less" ...
##  $ Do you ever recline your seat when you fly?                                                                                             : chr [1:1040] "" "About half the time" "Usually" "Always" ...
##  $ How tall are you?                                                                                                                       : chr [1:1040] "" "6'3\"\"" "5'8\"\"" "5'11\"\"" ...
##  $ Do you have any children under 18?                                                                                                      : chr [1:1040] "" "Yes" "No" "No" ...
##  $ In a row of three seats, who should get to use the two arm rests?                                                                       : chr [1:1040] "" "The arm rests should be shared" "Whoever puts their arm on the arm rest first" "The arm rests should be shared" ...
##  $ In a row of two seats, who should get to use the middle arm rest?                                                                       : chr [1:1040] "" "The arm rests should be shared" "The arm rests should be shared" "The arm rests should be shared" ...
##  $ Who should have control over the window shade?                                                                                          : chr [1:1040] "" "Everyone in the row should have some say" "The person in the window seat should have exclusive control" "Everyone in the row should have some say" ...
##  $ Is itrude to move to an unsold seat on a plane?                                                                                         : chr [1:1040] "" "No, not rude at all" "No, not rude at all" "No, not rude at all" ...
##  $ Generally speaking, is it rude to say more than a few words tothe stranger sitting next to you on a plane?                              : chr [1:1040] "" "No, not at all rude" "No, not at all rude" "No, not at all rude" ...
##  $ On a 6 hour flight from NYC to LA, how many times is it acceptable to get up if you're not in an aisle seat?                            : chr [1:1040] "" "Twice" "Three times" "Three times" ...
##  $ Under normal circumstances, does a person who reclines their seat during a flight have any obligation to the person sitting behind them?: chr [1:1040] "" "Yes, they should not recline their chair if the person behind them asks them not to" "Yes, they should not recline their chair if the person behind them asks them not to" "No, the person on the flight has no obligation to the person behind them" ...
##  $ Is itrude to recline your seat on a plane?                                                                                              : chr [1:1040] "" "Yes, somewhat rude" "No, not rude at all" "No, not rude at all" ...
##  $ Given the opportunity, would you eliminate the possibility of reclining seats on planes entirely?                                       : chr [1:1040] "" "No" "No" "No" ...
##  $ Is it rude to ask someone to switch seats with you in order to be closer to friends?                                                    : chr [1:1040] "" "No, not at all rude" "No, not at all rude" "Yes, somewhat rude" ...
##  $ Is itrude to ask someone to switch seats with you in order to be closer to family?                                                      : chr [1:1040] "" "No, not at all rude" "No, not at all rude" "No, not at all rude" ...
##  $ Is it rude to wake a passenger up if you are trying to go to the bathroom?                                                              : chr [1:1040] "" "No, not at all rude" "No, not at all rude" "No, not at all rude" ...
##  $ Is itrude to wake a passenger up if you are trying to walk around?                                                                      : chr [1:1040] "" "No, not at all rude" "Yes, somewhat rude" "Yes, somewhat rude" ...
##  $ In general, is itrude to bring a baby on a plane?                                                                                       : chr [1:1040] "" "No, not at all rude" "Yes, somewhat rude" "Yes, somewhat rude" ...
##  $ In general, is it rude to knowingly bring unruly children on a plane?                                                                   : chr [1:1040] "" "No, not at all rude" "Yes, very rude" "Yes, very rude" ...
##  $ Have you ever used personal electronics during take off or landing in violation of a flight attendant's direction?                      : chr [1:1040] "" "No" "No" "No" ...
##  $ Have you ever smoked a cigarette in an airplane bathroom when it was against the rules?                                                 : chr [1:1040] "" "No" "No" "No" ...
##  $ Gender                                                                                                                                  : chr [1:1040] "" "Male" "Male" "Male" ...
##  $ Age                                                                                                                                     : chr [1:1040] "" "30-44" "30-44" "30-44" ...
##  $ Household Income                                                                                                                        : chr [1:1040] "" "" "$100,000 - $149,999" "$0 - $24,999" ...
##  $ Education                                                                                                                               : chr [1:1040] "" "Graduate degree" "Bachelor degree" "Bachelor degree" ...
##  $ Location (Census Region)                                                                                                                : chr [1:1040] "" "Pacific" "Pacific" "Pacific" ...
##  - attr(*, ".internal.selfref")=<externalptr>
kaggle_sur <- fread("~/Desktop/R_tutorials/data/kaggle_sur.csv") %>%
    as_tibble()
str(kaggle_sur)
## tibble [16,716 × 47] (S3: tbl_df/tbl/data.frame)
##  $ LearningPlatformUsefulnessArxiv            : chr [1:16716] NA NA "Very useful" NA ...
##  $ LearningPlatformUsefulnessBlogs            : chr [1:16716] NA NA NA "Very useful" ...
##  $ LearningPlatformUsefulnessCollege          : chr [1:16716] NA NA "Somewhat useful" "Very useful" ...
##  $ LearningPlatformUsefulnessCompany          : chr [1:16716] NA NA NA NA ...
##  $ LearningPlatformUsefulnessConferences      : chr [1:16716] "Very useful" NA NA "Very useful" ...
##  $ LearningPlatformUsefulnessFriends          : chr [1:16716] NA NA NA "Very useful" ...
##  $ LearningPlatformUsefulnessKaggle           : chr [1:16716] NA "Somewhat useful" "Somewhat useful" NA ...
##  $ LearningPlatformUsefulnessNewsletters      : chr [1:16716] NA NA NA NA ...
##  $ LearningPlatformUsefulnessCommunities      : chr [1:16716] NA NA NA NA ...
##  $ LearningPlatformUsefulnessDocumentation    : chr [1:16716] NA NA NA "Very useful" ...
##  $ LearningPlatformUsefulnessCourses          : chr [1:16716] NA NA "Very useful" "Very useful" ...
##  $ LearningPlatformUsefulnessProjects         : chr [1:16716] NA NA NA "Very useful" ...
##  $ LearningPlatformUsefulnessPodcasts         : chr [1:16716] "Very useful" NA NA NA ...
##  $ LearningPlatformUsefulnessSO               : chr [1:16716] NA NA NA NA ...
##  $ LearningPlatformUsefulnessTextbook         : chr [1:16716] NA NA NA NA ...
##  $ LearningPlatformUsefulnessTradeBook        : chr [1:16716] "Somewhat useful" NA NA NA ...
##  $ LearningPlatformUsefulnessTutoring         : chr [1:16716] NA NA NA NA ...
##  $ LearningPlatformUsefulnessYouTube          : chr [1:16716] NA NA "Very useful" NA ...
##  $ CurrentJobTitleSelect                      : chr [1:16716] "DBA/Database Engineer" NA NA "Operations Research Practitioner" ...
##  $ MLMethodNextYearSelect                     : chr [1:16716] "Random Forests" "Random Forests" "Deep learning" "Neural Nets" ...
##  $ WorkChallengeFrequencyPolitics             : chr [1:16716] "Rarely" NA NA "Often" ...
##  $ WorkChallengeFrequencyUnusedResults        : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyUnusefulInstrumenting: chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyDeployment           : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyDirtyData            : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyExplaining           : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyPass                 : chr [1:16716] NA NA NA NA ...
##  $ WorkChallengeFrequencyIntegration          : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyTalent               : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyDataFunds            : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyDomainExpertise      : chr [1:16716] NA NA NA "Most of the time" ...
##  $ WorkChallengeFrequencyML                   : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyTools                : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyExpectations         : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyITCoordination       : chr [1:16716] NA NA NA NA ...
##  $ WorkChallengeFrequencyHiringFunds          : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyPrivacy              : chr [1:16716] "Often" NA NA "Often" ...
##  $ WorkChallengeFrequencyScaling              : chr [1:16716] "Most of the time" NA NA "Often" ...
##  $ WorkChallengeFrequencyEnvironments         : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyClarity              : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyDataAccess           : chr [1:16716] NA NA NA "Often" ...
##  $ WorkChallengeFrequencyOtherSelect          : chr [1:16716] NA NA NA NA ...
##  $ WorkInternalVsExternalTools                : chr [1:16716] "Do not know" NA NA "Entirely internal" ...
##  $ FormalEducation                            : chr [1:16716] "Bachelor's degree" "Master's degree" "Master's degree" "Master's degree" ...
##  $ Age                                        : int [1:16716] NA 30 28 56 38 46 35 22 43 33 ...
##  $ DataScienceIdentitySelect                  : chr [1:16716] "Yes" "Yes" "Yes" "Yes" ...
##  $ JobSatisfaction                            : chr [1:16716] "5" NA NA "10 - Highly Satisfied" ...
##  - attr(*, ".internal.selfref")=<externalptr>

We have columns of various datatypes, the goal of this post is to work with categorical features (i.e columns that are of type factors). To check whether a particular variable is a factor or not, we simply pass the data and the variable to is.factor() and we get a boolean as an output

We get FALSE which means we will need to convert this to a factor.

You are probabaly aware of mutate() and summarise() in dply, here we look at a slight modified version of these - mutate_if(), summarise_if, mutate_all() and summarise_all. mutate_if() and summarise_if apply their second argument, a function, to all columns where the first argument is true, and mutate_all() and summarise_all() take one argument, a function, and apply it to all columns. Let’s convert all the columns that are characters (note first argument has to be true) to factors

responses_as_factors <- kaggle_sur %>%
    mutate_if(is.character, as.factor)
str(responses_as_factors)
## tibble [16,716 × 47] (S3: tbl_df/tbl/data.frame)
##  $ LearningPlatformUsefulnessArxiv            : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA 3 NA 3 NA 2 NA NA 2 ...
##  $ LearningPlatformUsefulnessBlogs            : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA 3 NA NA 2 NA 3 2 ...
##  $ LearningPlatformUsefulnessCollege          : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA 2 3 NA NA NA 3 NA NA ...
##  $ LearningPlatformUsefulnessCompany          : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA NA NA NA 3 NA ...
##  $ LearningPlatformUsefulnessConferences      : Factor w/ 3 levels "Not Useful","Somewhat useful",..: 3 NA NA 3 2 NA NA NA 3 2 ...
##  $ LearningPlatformUsefulnessFriends          : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA 3 NA NA NA NA 3 NA ...
##  $ LearningPlatformUsefulnessKaggle           : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA 2 2 NA 2 3 2 3 3 2 ...
##  $ LearningPlatformUsefulnessNewsletters      : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA NA NA NA 3 NA ...
##  $ LearningPlatformUsefulnessCommunities      : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA NA NA NA 3 2 ...
##  $ LearningPlatformUsefulnessDocumentation    : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA 3 NA NA NA NA NA 3 ...
##  $ LearningPlatformUsefulnessCourses          : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA 3 3 NA 3 NA 3 3 3 ...
##  $ LearningPlatformUsefulnessProjects         : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA 3 NA NA 2 NA NA 3 ...
##  $ LearningPlatformUsefulnessPodcasts         : Factor w/ 3 levels "Not Useful","Somewhat useful",..: 3 NA NA NA NA NA NA NA NA 2 ...
##  $ LearningPlatformUsefulnessSO               : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA 3 NA 3 NA 2 ...
##  $ LearningPlatformUsefulnessTextbook         : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA 2 3 3 NA NA 3 ...
##  $ LearningPlatformUsefulnessTradeBook        : Factor w/ 3 levels "Not Useful","Somewhat useful",..: 2 NA NA NA NA NA NA NA NA NA ...
##  $ LearningPlatformUsefulnessTutoring         : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LearningPlatformUsefulnessYouTube          : Factor w/ 3 levels "Not Useful","Somewhat useful",..: NA NA 3 NA NA NA NA 3 3 2 ...
##  $ CurrentJobTitleSelect                      : Factor w/ 16 levels "Business Analyst",..: 6 NA NA 9 2 5 2 15 1 15 ...
##  $ MLMethodNextYearSelect                     : Factor w/ 25 levels "Anomaly Detection",..: 17 17 6 14 23 9 23 6 11 6 ...
##  $ WorkChallengeFrequencyPolitics             : Factor w/ 4 levels "Most of the time",..: 3 NA NA 2 2 NA NA NA NA NA ...
##  $ WorkChallengeFrequencyUnusedResults        : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 4 NA NA NA NA 4 ...
##  $ WorkChallengeFrequencyUnusefulInstrumenting: Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
##  $ WorkChallengeFrequencyDeployment           : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
##  $ WorkChallengeFrequencyDirtyData            : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA 1 NA NA 2 2 ...
##  $ WorkChallengeFrequencyExplaining           : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA 2 ...
##  $ WorkChallengeFrequencyPass                 : Factor w/ 4 levels "Most of the time",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WorkChallengeFrequencyIntegration          : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
##  $ WorkChallengeFrequencyTalent               : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 4 NA NA NA NA 2 ...
##  $ WorkChallengeFrequencyDataFunds            : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 4 NA NA NA NA NA ...
##  $ WorkChallengeFrequencyDomainExpertise      : Factor w/ 4 levels "Most of the time",..: NA NA NA 1 4 NA NA NA NA 4 ...
##  $ WorkChallengeFrequencyML                   : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
##  $ WorkChallengeFrequencyTools                : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA 3 ...
##  $ WorkChallengeFrequencyExpectations         : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
##  $ WorkChallengeFrequencyITCoordination       : Factor w/ 4 levels "Most of the time",..: NA NA NA NA 4 NA NA NA NA NA ...
##  $ WorkChallengeFrequencyHiringFunds          : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA 4 ...
##  $ WorkChallengeFrequencyPrivacy              : Factor w/ 4 levels "Most of the time",..: 2 NA NA 2 1 NA NA NA NA 3 ...
##  $ WorkChallengeFrequencyScaling              : Factor w/ 4 levels "Most of the time",..: 1 NA NA 2 NA NA NA NA NA 3 ...
##  $ WorkChallengeFrequencyEnvironments         : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 4 NA NA NA NA NA ...
##  $ WorkChallengeFrequencyClarity              : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
##  $ WorkChallengeFrequencyDataAccess           : Factor w/ 4 levels "Most of the time",..: NA NA NA 2 NA NA NA NA NA NA ...
##  $ WorkChallengeFrequencyOtherSelect          : Factor w/ 4 levels "Most of the time",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WorkInternalVsExternalTools                : Factor w/ 6 levels "Approximately half internal and half external",..: 2 NA NA 4 1 6 4 NA 4 2 ...
##  $ FormalEducation                            : Factor w/ 7 levels "Bachelor's degree",..: 1 5 5 5 2 2 5 1 1 1 ...
##  $ Age                                        : int [1:16716] NA 30 28 56 38 46 35 22 43 33 ...
##  $ DataScienceIdentitySelect                  : Factor w/ 3 levels "No","Sort of (Explain more)",..: 3 3 3 3 1 NA 1 1 1 2 ...
##  $ JobSatisfaction                            : Factor w/ 11 levels "1 - Highly Dissatisfied",..: 6 NA NA 2 3 9 9 NA 8 8 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Now we can find the number of levels in each column with the following code

we can also look at 3 rows with the highest number of levels

If we want to look at the number of level of a specific variable we can do

or we can even see what levels there are in this column

and we can plot these levels as a bar plot since the column is categorical

note, how the plot is unordered and sometimes hard to compare each of the levels. We can make the bar plot ordered using fct_infreq() from the forcats package. Just an fyi, anything that starts with fct comes from the forcats package

also here I’ve used fct_rev() to reverse the order of the bar plot, so from top to bottom it goes highest count to lowest.

We can look at the mean age of the people with each of these job title. Here we make use of fct_reorder() to order one varaiable based on another

Manipulating factor variables

recode - to change bunch of values in a col to something else. Remember that when recoding numeric variables, you need to put the old value in backticks. .default option see Working with Data in the Tidyverse n parse_number.

Creating factors variable

Case study

Expression Does this
. matches any character
* zero or more times