31 Factors
We often have a requirement, where we need to have our character variable representing categorical data that should take values from a fixed and known set of finite values (or categories). Additionally sometimes these categories need to be sorted in a specific order which may not be alphabetical. Categorical data like this plays an important role in data analytics.
To deal with such categorical data, we have a special class factor
in R (Readers may remember that we learnt about this data type, though in short only, in section 1.5.1). In data analytical tasks we often need to create use factors, so let us discuss about these in a bit detail in this chapter.
31.1 Factors in base R
Factors in R are objects built over atomic data-type integer
. We have two primary functions to create (and coerce) factors from character vectors.
factor
as.factor()
Of these, factor
provides us with full customisation as it is the basic function to create factor objects in base R. Let us discuss both.
31.1.1 Creating factors from scratch
Now, let us create factor from a character vector having values from shirt_sizes
.
shirt_sizes <- c("S", "M", "L", "XL", "XXL")
Let us say, we have 10 shirts from these sizes randomly.
# 10 shirts with following sizes
# Notice one shirt with size small case l and one XXXL
shirts <- c("S", "s", "L", "XL", "XXXL", "S", "M", "M", "L", "L")
# Let us create a factor of shirt_sizes
shirt_f <- factor(shirts, levels = shirt_sizes)
shirt_f
## [1] S <NA> L XL <NA> S M M L L
## Levels: S M L XL XXL
In the output above we get NA
s silently in place of values that were not available in our allowed values (read levels).
Now check its type and class.
typeof(shirt_f)
## [1] "integer"
class(shirt_f)
## [1] "factor"
We can see that factor
class is actually built upon the integer
class underneath and labels (taking values from levels by default) shown in the output.
unclass(shirt_f)
## [1] 1 NA 3 4 NA 1 2 2 3 3
## attr(,"levels")
## [1] "S" "M" "L" "XL" "XXL"
We can, however, modify labels without modifying the levels by providing values (in a vector) to the labels
argument. Let us now create another factor with labels different than levels.
set.seed(123)
# 100 Shirts
shirts2 <- sample(shirt_sizes, 100, replace = TRUE)
shirts2_f <- factor(shirts2, levels = shirt_sizes, labels = c("Small", "Medium", "Large", "Extra Large", "Extra Extra large"))
# Let's view some shirts
head(shirts2_f)
## [1] Large Large Medium Medium
## [5] Large Extra Extra large
## Levels: Small Medium Large Extra Large Extra Extra large
# Check its levels
levels(shirts2_f)
## [1] "Small" "Medium" "Large"
## [4] "Extra Large" "Extra Extra large"
Function summary
gives us count for each level in the given factor.
summary(shirts2_f)
## Small Medium Large Extra Large
## 21 20 23 17
## Extra Extra large
## 19
31.1.2 Coercing objects to factor
Up to now, we created a factor from set of allowable finite values (read levels
) and displayed them using meaningful labels
. Function is.factor()
will check whether the given vector is a factor or not. On the other hand, function as.factor
will coerce the existing character vector to a factor variable by using all the distinct values available therein as levels, but sorted alphabetically.
is.factor(iris$Species)
## [1] TRUE
## [1] L L M M L XXL
## Levels: L M S XL XXL
levels(shirts2_coerced)
## [1] "L" "M" "S" "XL" "XXL"
# Check summary too
summary(shirts2_coerced)
## L M S XL XXL
## 23 20 21 17 19
This problem was not there in our earlier factor created from scratch, because there it took levels from the provided vector which we sorted meaningfully ourselves, and thus having more control.
31.1.3 Order in factors
To sort the factors in a meaningful way, we can actually create ordered factor. Ordered factor can be created either by
- using
ordered = TRUE
argument in functionfactor
; or by - coercing a given factor to an ordered factor using function
ordered()
.
Clearly, latter method will again order the given factor as per the levels present in the factor.
shirts2_ordered <- factor(
shirts2,
levels = shirt_sizes,
labels = c("Small", "Medium", "Large",
"Extra Large", "Extra Extra large"),
ordered = TRUE
)
Ordering a factor in R has another benefit, we can actually perform calculations on the ordered factor. Suppose if we want to find how many shirts do we have of sizes “Large” or greater.
# How many shirts are there of sizes L or greater?
sum(shirts2_ordered >= "Large")
## [1] 59
# But unordered factor will result in error.
sum(shirts2_f >= "Large")
## [1] NA
We can check the given factor is an ordered factor or not using function is.ordered()
. E.g.
is.ordered(shirts2_ordered)
## [1] TRUE
is.ordered(shirts2_f)
## [1] FALSE
31.1.4 Functions returning factors as output
Readers may remember that in section 4.12.9 we learnt about a function cut
which returns a factor variable as output. In the next sections we will learn some functions which will be useful while working with factor variables, either as input or output or both.
31.2 Factors in forcats
Package forcats
which is part of core tidyverse provides us with more robust and useful ways to create and deal with factor variables. In forcats
we have function fct
for creating factor variables. It will produce errors if any value is not available in the given levels, to avoid bugs/typographical errors in the code. E.g.
library(forcats)
months_31 <- c("Jan", "Mar", "May", "Jul", "Aug", "Oxt", "Dec")
fct(months_31, levels = month.abb)
## Error in `fct()`:
## ! All values of `x` must appear in `levels` or `na`
## ℹ Missing level: "Oxt"
31.3 Inspecting Factors
31.3.1 Summarising factors
The summary()
method works well to give the counts for each level.
## L M XXL XL S
## 23 20 19 17 21
Like count
in dplyr
, here we have fct_count()
to give us level wise counts and/or proportions. The difference from summary()
is however in output type. Function fct_count()
returns a tibble
instead. E.g.
fct_count(shirts2_fct)
## # A tibble: 5 × 2
## f n
## <fct> <int>
## 1 L 23
## 2 M 20
## 3 XXL 19
## 4 XL 17
## 5 S 21
# Sort in decreasing counts
fct_count(shirts2_fct, sort = TRUE)
## # A tibble: 5 × 2
## f n
## <fct> <int>
## 1 L 23
## 2 S 21
## 3 M 20
## 4 XXL 19
## 5 XL 17
# Include proportions also
fct_count(shirts2_fct, sort = TRUE, prop = TRUE)
## # A tibble: 5 × 3
## f n p
## <fct> <int> <dbl>
## 1 L 23 0.23
## 2 S 21 0.21
## 3 M 20 0.2
## 4 XXL 19 0.19
## 5 XL 17 0.17
31.3.2 Unique levels only
Function fct_unique()
from the package, returns a factor with unique values, removing duplicates. E.g.
fct_unique(shirts2_fct)
## [1] L M XXL XL S
## Levels: L M XXL XL S
31.4 Order in Factors
31.4.1 Default ordering in factors
Orders created in factor variables using fct
are sorted as per the levels given in the level argument. If the argument is not supplied then it is sorted on the basis of first appearance (as against alphabetically in the factor
), as we observed in output of above example.
To learn more functions from forcats
we will use gss_cat
data frame which is part of the forcats
package itself and GSS here stands for General Social Survey. It actually consists of many factor variables. For some other use cases, we will also use economics_long
data which is part of tidyr
package.
31.4.2 Reordering factors
If we analyse the (mean) number of hours spent per day on TV watching across different religions in gss_cat
, we can see that -
library(tidyverse)
gss_cat |>
summarise(tv = mean(tvhours, na.rm = TRUE), .by = relig) |>
ggplot(aes(relig, tv)) +
geom_col() +
coord_flip()

Figure 31.1: Factors with default order
In above case (31.1), the order of relig
factor is meaningful when sorted on the basis of summarised values of another numerical variable present in the data. Function fct_reorder
is helpful in these scenarios.
fct_reorder(
.f,
.x,
.fun = median,
...,
.na_rm = NULL,
.default = Inf,
.desc = FALSE
)
Where -
-
.f
is the factor variable to be sorted. -
.x
is the numerical variable based on which.f
is to be sorted. -
.fun
is the optional function (default ismedian
) to be used when there are multiple values of .x for any of the level in.f
.
So, in above example, we can re-order the levels using this function. See figure 31.2 -
gss_cat |>
summarise(tv = mean(tvhours, na.rm = TRUE), .by = relig) |>
ggplot(aes(fct_reorder(relig, tv), tv)) +
geom_col() +
coord_flip() +
labs(x = "Religion")

Figure 31.2: Factors with reordered levels
This function is also useful in sorting box-plots. As an example refer Figure 31.3.
economics_long |>
ggplot(aes(x = value01, y = variable)) +
geom_boxplot() +
ggtitle("Unsorted Boxes")

Figure 31.3: Unsorted boxes with unordered levels
economics_long |>
mutate(variable = fct_reorder(variable, value01)) |>
ggplot(aes(x = value01, y = variable)) +
geom_boxplot() +
ggtitle("Boxes sorted on Median")

Figure 31.4: Boxes sorted on Median reordering factor levels
31.4.3 Reordering factors with two other variables.
Sometimes, a factor variable needs to be sorted on the basis of first (or last) values of two other variables. In such fct_reorder2()
is useful. As compared to fct_reorder()
it takes an extra argument .y
and is having syntax like
fct_reorder2(
.f,
.x,
.y,
.fun = last2,
...,
.na_rm = NULL,
.default = -Inf,
.desc = TRUE
)
Here default function is last2
which simply means that levels of .f
are sorted on the basis of last values of .y
when plotted against .x
as in grouped line charts. E.g. See Figure 31.5.
library(patchwork)
library(conflicted)
conflicts_prefer(dplyr::filter)
default <- economics_long |>
filter(date < dmy("31122003"), date >= dmy("01011995")) |>
ggplot(aes(date, value01)) +
geom_line(aes(group = variable, color = variable)) +
ggtitle("Default legend")
aligned <- economics_long |>
filter(date < dmy("31122003"), date >= dmy("01011995")) |>
mutate(variable = fct_reorder2(variable, date, value01)) |>
ggplot(aes(date, value01)) +
geom_line(aes(group = variable, color = variable)) +
ggtitle("Legend aligned with \nlast values of each line")
default + aligned

Figure 31.5: Factors reordered on two criteria
31.4.4 Changing orders of few factor labels only
Sometimes, we may have a factor whose levels are already meaningfully sorted. E.g. income levels in gss_cat
. Check the plot in Figure 31.6.
gss_cat |>
ggplot(aes(rincome)) +
geom_bar() +
coord_flip()

Figure 31.6: Default Income levels
The levels of income are already sorted in a meaningful way. However, sometimes we may want to change order of a few levels only. E.g. Not applicable
in the Figure 31.6 which if re-leveled in the end may be more meaningful. In such cases, we may use fct_relevel
. The function takes a factor variable and thereafter we may pass all those levels as arguments which we want to move in the end. After rearranging the bars (levels), the plot will look like as in Figure 31.7
gss_cat |>
ggplot(aes(fct_relevel(rincome, "Not applicable"))) +
geom_bar() +
coord_flip() +
labs(x= "Income levels")

Figure 31.7: Modifying levels manually
31.4.5 Ordering bar charts in order of frequency
Function fct_infreq()
is helpful in sorting the factor in decreasing order of frequency and thus, can be used to sort the bar charts (Refer charts in Figure 31.8).
default <- mpg |>
ggplot(aes(trans)) +
geom_bar()
increasing <- mpg |>
ggplot(aes(fct_infreq(trans))) +
geom_bar()
default + increasing

Figure 31.8: Default and increasing order by Frequency
31.4.6 Reversing the factor levels
Using fct_rev()
we can reverse the order of levels in any factor. E.g. Figure 31.9.

Figure 31.9: Reversing levels
31.4.7 Other reordering
There are two more functions which can be used to reorder factor levels -
-
fct_inorder()
: by the order in which they first appear. -
fct_inseq()
: by numeric value of level.
Readers may explore these functions by themselves.
31.4.8 More on ordering factors
In section 31.1.3 we saw how an unordered factor can be turned into an ordered factor. This ordering can cause one side effect while plotting in ggplot2. Ordered factor use scale_color_viridis
by default whereas unordered factor doesn’t. See following example (Notice how the color scale has been Figure 31.10 to Figure 31.11).
mtcars %>%
mutate(cyl = ordered(factor(cyl))) %>%
ggplot(aes(wt, mpg)) +
geom_point(size = 5, aes(color = cyl)) +
ggtitle("Ordered Factor")

Figure 31.10: Use of color scale in ordered and unordered factors
mtcars %>%
mutate(cyl = factor(cyl)) %>%
ggplot(aes(wt, mpg)) +
geom_point(size = 5, aes(color = cyl)) +
ggtitle("Unordered Factor")

Figure 31.11: Use of color scale in ordered and unordered factors
31.5 Levels in Factors
31.5.1 Modifying factor levels by applying a function
Function fct_relabel
in forcats
powerhouse applies a function .fun
to each of the level in .f
factor supplied to it. E.g. Change the case of each of the variable name in economics_long
(Figure 31.12).
economics_long |>
mutate(variable = fct_relabel(variable, str_to_upper)) |>
ggplot(aes(x = value01, y = variable)) +
geom_boxplot()

Figure 31.12: Applying a function to all labels
31.5.2 Modifying factor levels manually
Using function fct_recode()
we can change the levels from the given factor manually. We have to provide new levels manually through a sequence of named character vectors where the name gives the new level, and the value gives the old level. Levels not otherwise mentioned will be left as is. Levels can be removed by naming them NULL
. See Example
x <- factor(c("apple", "bear", "banana", "dear"))
fct_recode(x, fruit = "apple", fruit = "banana")
## [1] fruit bear fruit dear
## Levels: fruit bear dear
To collapse multiple levels (lumping) into one we can use its cousin fct_collapse()
. Example
x <- factor(c("apple", "bear", "banana", "dear"))
fct_collapse(x, fruit = c("apple", "banana"))
## [1] fruit bear fruit dear
## Levels: fruit bear dear
31.5.3 Lump uncommon factor levels into other
Package forcats
provides us a family of 4 functions that are useful in lumping together the levels meeting some given criteria. These are
-
fct_lump_min()
: lumps levels that appear fewer than min times. -
fct_lump_prop()
: lumps levels that appear in fewer than (or equal to)prop
*n
times. -
fct_lump_n()
lumps all levels except for the n most frequent (or least frequent ifn
< 0) -
fct_lump_lowfreq()
lumps together the least frequent levels, ensuring that"other"
is still the smallest level.
These all functions, apart from taking factors f
as argument, also take one or more argument, which fits into the case-
-
n
Positiven
preserves the most commonn
values. Negativen
preserves the least common-n
values. -
prop
Positiveprop
lumps values which do not appear at leastprop
of the time. Negative prop lumps values that do not appear at most-prop
of the time. -
min
Preserve levels that appear at leastmin
number of times. -
w
An optional numeric vector giving weights for frequency of each value (not level) inf
. -
other_level
: Value of level used for"other"
(default) values. Always placed at end of levels.
Some examples-
## # A tibble: 15 × 2
## relig n
## <fct> <int>
## 1 No answer 93
## 2 Don't know 15
## 3 Inter-nondenominational 109
## 4 Native american 23
## 5 Christian 689
## 6 Orthodox-christian 95
## 7 Moslem/islam 104
## 8 Other eastern 32
## 9 Hinduism 71
## 10 Buddhism 147
## 11 Other 224
## 12 None 3523
## 13 Jewish 388
## 14 Catholic 5124
## 15 Protestant 10846
# Let's restrict these to 5 religions only
gss_cat %>%
ggplot(aes(fct_lump_n(relig, n = 5))) +
geom_bar()

Figure 31.13: Lumping Factors
In Figure 31.13 we can see that there are five religions plus "other"
category placed in last. We may also notice that bars are not sorted yet.
We may also make use of w
argument, if we already have our factor and its counts of levels in another vector. See figure 31.14 -
gss_cat %>%
count(relig) %>%
# Making use of `w` argument
mutate(relig = fct_lump_n(relig, n = 5, w = n)) %>%
ggplot(aes(relig, n)) +
geom_bar(stat = "identity")

Figure 31.14: Lumping Factors by making use of w argument
To sort the levels in increasing order of frequency we can use fct_infreq()
function which we learnt in section 31.4.5. It may also take w
argument (optionally, of course) if we have our factor levels already counted. So, to sort the levels in Figure 31.14, we may make use of this function one step before lumping. See figure 31.15.
gss_cat %>%
count(relig) %>%
# Sorting making use of `w` argument
mutate(relig = fct_infreq(relig, w = n),
relig = fct_lump_n(relig, n = 5, w = n)) %>%
ggplot(aes(relig, n)) +
geom_bar(stat = "identity")

Figure 31.15: Sorting and Lumping Factors