Dangers of implicit type conversion in R

By | November 8, 2015

As you might be aware, R usually does implicit type conversion of your input variable in the expected type whenever necessary. For example, paste expects characters and therefore

paste("example", 1)

works by implicit type conversion of numeric to character, and you do not need to use

paste("example", as.character(1))

instead. Usually, this is very convenient. But there are at least two ways I observed where this implicit type conversion can cause major bugs.

Implicit conversion of factors to integers

The first way has to with factors and is pretty well known. If you use factor in an index, the factor is converted to integer. This is an example of implicit type conversion, as you do not have to tell R to do it and you are not even warnred that R converted your type. In some cases, your factor levels correspond to the names of what you are indexing, and you would expect that R is going to index by matching factor level to column name.

factor.var <- as.factor(c("A", "B", "C")) # define factor
num.var <- 1:3 # numeric variable
names(num.var) <- c("C", "B", "A") # names match levels of factor
as.integer(factor.var) # explicit conversion to integer
[1] 1 2 3
# implicit type conversion of factor.var to integer
num.var[factor.var] 
C B A
1 2 3
# factor.var is explicitly converted to character
num.var[as.character(factor.var)]
A B C
3 2 1

I think most people working with R stumbled over this at least once. I know I did. There is also a chance that sometimes the factor levels are just in the right order for the code to work, so you might get away with doing it at first.

Number character comparisons

Somewhat less well known is what happens if you compare a number with a string. Look at

0.01 < "0.05"
[1] TRUE

Looks fine, right? But now consider

0.0000001 < "0.05"
[1] FALSE

What went wrong? R can not always convert a character to a numeric, so in this case it does the “safe” operation of converting the number to character instead.

as.character(0.01)
[1] "0.01"
as.character(0.0000001)
[1] "1e-07"

The second number is small enough to be converted to scientific notation. And based on the documented rules of comparing strings

"1e-07" < "0.05"
[1] FALSE

is the correct and expected result. This one is especially nasty as it depends on a global setting of R, which is how many digits to accept before switching to scientific notation.

options(scipen=10)
0.0000001 < "0.05"
[1] TRUE
as.character(0.0000001)
"0.0000001"

This means that if you write code like this and put it in a package, the result will depend upon the settings of the user. Bugs like these tend to be very hard to track down.

What do we learn from this? Always be careful when mixing up types in R. It is very convenient, but can also be dangerous. Use explicit casting whenever you do non-standard things with your variables to avoid nasty surprises. Also try to keep in mind what class your variables actually have! For example, people new to R often do not expect that text columns from data tables (e.g. csv or tsv) are converted to to factors by default when reading them into R.

Further reading

The R-Inferno has more on this and other common pitfalls when working with R. Don’t miss reading at least the free .pdf version. If I run into other examples, I will also write about them in the future as well.

Leave a Reply

Your email address will not be published. Required fields are marked *