Author Archives: Erik

Partial AUC support added to fbroc 0.4.0

Support for partial AUC was added to fbroc with the version 0.4.0, which has just been accepted on CRAN.

Official changelog

fbroc 0.4.0

New features

  • Partial AUCs over both TPR and FPR ranges can be calculated
  • You can now adjust text size for plots
  • In the ROC plot the (partial) AUC can now optionally be shown instead of confidence regions

Other Changes

  • The location of the text showing the performance in the ROC plot has been shifted downwards and
    to the left

Changes in detail

There are only two changes worthwhile mentioning in fbroc 0.4.0. The first one is an option to adjust the text-size when printing out the performance details on the ROC plot. This change was motivated by the text sometimes being too wide for the graph – I observed this effect on my mobile phone.

A more important addition is the ability to handle partial AUC, integrating the part of the ROC curve over a specific FPR or TPR interval. The typical McClish correction for the partial AUC is applied by default. I will talk about it and more details about the partial AUC in a later post.

Plotting the partial ROC area was also a bit of a challenge, as the overlap with the confidence region around the ROC curve makes it difficult. Therefore, when fbroc shows the confidence region, the partial AUC region is only denoted by a pair of dotted lines. By setting

show.conf = FALSE 

in the plotting call when the metric being shown is the partial AUC, the relevant area is shown instead.

ROC curves showing uncertainty or partial AUC area

Left: show.conf = true Right: show.conf = FALSE


As a minor bonus this now also works for the normal AUC.

Implementation

The C++ code for calculating the partial AUC was somewhat tricky, as it needed to work when integrating over both FPR and TPR. As an example, take a look at the function used to integrate over a TPR interval by calculating the area contributed by the part of the ROC between the (i-1)-th and the i-th cutoff.

double pauc_tpr_area(NumericVector &tpr, NumericVector &fpr, 
                     NumericVector &param, int index)
{
  // necessary check to avoid division by zero later
  if (tpr(index - 1) == tpr[index]) return 0; 
  // cases where relevant TPR interval is not included
  if (tpr[index - 1] < param[0]) return 0;
  if (tpr[index] > param[1]) return 0;
  
  double left = std::max(tpr[index], param[0]);
  double right = std::min(tpr[index - 1], param[1]);
  
  double base_val = 1 - fpr[index];
  double slope = (fpr[index] - fpr[index - 1]) / 
                 (tpr[index - 1] - tpr[index]);

  double value_left = base_val + (left - tpr[index]) * slope;
  double value_right = base_val + (right - tpr[index]) * slope;
  
  return (right - left) * (value_left + value_right);
}

The first line excludes a case where the contribution to the partial AUC is zero anyway, because we are looking at a line instead of an area. After using this code segment

double left = std::max(tpr[index], param[0]);
double right = std::min(tpr[index - 1], param[1]);

to account for the case that the area actually contributing is just a slice of the full trapezoid between the TPRs for the (i-1)-th and i-th cutoff when the TPR interval used for the partial AUC does not fully encompass it, the difference would cancel out in the product used for the trapezoid rule in this line here

return (right - left) * (value_left + value_right);

but would lead to NaN numbers when fbroc calculates the slope for the trapezoid rule as follows.

double slope = (fpr[index] - fpr[index - 1]) / 
                 (tpr[index - 1] - tpr[index]);

What’s next

After releasing fbroc 0.4.0 I will first update the shiny interface, before working more on the package itself. I will also first create a shiny interface for the analysis of paired ROC curves – something I originally planned to do before releasing fbroc 0.4.0. As it turned out, I decided to update the package first since I find that more enjoyable.

Technical troubles resolved

It seems the site is stable again. I never found out the cause, but updating the entire linux server to a new version of Ubuntu and setting up the blog again seems to have done the trick.

Technical troubles ongoing

So far I have not been able to resolve the issues. Until I do, I will disable most plugins and revert to the default theme. This means that some content will definitely be broken for now.

Technical troubles

During the last half-year I had trouble keeping the page up, as the site stops responding after a while. The rest of the server, including the shiny apps, work fine. Rebooting fixed the issue for a short time.

I am trying to get this page back online, but have not tracked down the issue yet. Currently I have tried disabling some old plugins. I hope this will fix the issue, but we will have to see.

On hiatus

As you might have noticed there haven’t been any updates lately, and this situation will probably not change for another month. I was busy both at work and at home, due to one of the kids having been fairly seriously ill.

Things are looking better now, but I want to work on the fbroc shiny interface before taking time to post things here.

Shiny interface for fbroc updated

I am happy to announce an updated shiny interface to my R package fbroc. Before updating the interface further to include the new features of fbroc, I wanted to update the interface first. Fortunately, there is an excellent package for the creation of dashboard interfaces for shiny: shinydashboard. I had to rewrite parts of the shiny interface. It was very much worth it.

The most difficult part was to get the graphs and boxes to scale correctly. First, I had trouble keeping the correct aspect ratio and then the boxes surrounding them did not scale properly with graph size. For some window sizes, part of the graph was outside the box. Since I had trouble fixing this, I will describe the problem and the solution later in a separate blog post. Maybe it will help someone else.

Comparison of old and new shiny interface

The TPR at a FPR of 0.03 depends upon how many of the outliers are included, making the confidence intervall very wide.

Old interface with fbroc 0.2.1

Updated shiny interface for fbroc, using package shinydashboard

Updated shiny interface for fbroc, using package shinydashboard

If you are interested in the code, you can find in on the GitHub page, as always. To test the interface, go here instead.

Outlook

As mentioned, next up is another update of the interface to support the main new feature of fbroc: comparison of two classifiers trying to solve the same prediction task on the same data. Even with very different classification algorithm, there is usually still a significant correlation between the predictions of the two models. Often the two subsets of samples misclassified by the classifiers have a large overlap. To correctly compare the model with bootstrap methods in this case, it is critical to keep the correlation intact. After the first release of fbroc, this feature was my highest priority.

Dangers of implicit type conversion in R

As you might be aware, R usually does implicit type conversion of your input variable in the expected type whenever necessary. For example, paste expects characters and therefore

paste("example", 1)

works by implicit type conversion of numeric to character, and you do not need to use

paste("example", as.character(1))

instead. Usually, this is very convenient. But there are at least two ways I observed where this implicit type conversion can cause major bugs.

Implicit conversion of factors to integers

The first way has to with factors and is pretty well known. If you use factor in an index, the factor is converted to integer. This is an example of implicit type conversion, as you do not have to tell R to do it and you are not even warnred that R converted your type. In some cases, your factor levels correspond to the names of what you are indexing, and you would expect that R is going to index by matching factor level to column name.

factor.var <- as.factor(c("A", "B", "C")) # define factor
num.var <- 1:3 # numeric variable
names(num.var) <- c("C", "B", "A") # names match levels of factor
as.integer(factor.var) # explicit conversion to integer
[1] 1 2 3
# implicit type conversion of factor.var to integer
num.var[factor.var] 
C B A
1 2 3
# factor.var is explicitly converted to character
num.var[as.character(factor.var)]
A B C
3 2 1

I think most people working with R stumbled over this at least once. I know I did. There is also a chance that sometimes the factor levels are just in the right order for the code to work, so you might get away with doing it at first.

Number character comparisons

Somewhat less well known is what happens if you compare a number with a string. Look at

0.01 < "0.05"
[1] TRUE

Looks fine, right? But now consider

0.0000001 < "0.05"
[1] FALSE

What went wrong? R can not always convert a character to a numeric, so in this case it does the “safe” operation of converting the number to character instead.

as.character(0.01)
[1] "0.01"
as.character(0.0000001)
[1] "1e-07"

The second number is small enough to be converted to scientific notation. And based on the documented rules of comparing strings

"1e-07" < "0.05"
[1] FALSE

is the correct and expected result. This one is especially nasty as it depends on a global setting of R, which is how many digits to accept before switching to scientific notation.

options(scipen=10)
0.0000001 < "0.05"
[1] TRUE
as.character(0.0000001)
"0.0000001"

This means that if you write code like this and put it in a package, the result will depend upon the settings of the user. Bugs like these tend to be very hard to track down.

What do we learn from this? Always be careful when mixing up types in R. It is very convenient, but can also be dangerous. Use explicit casting whenever you do non-standard things with your variables to avoid nasty surprises. Also try to keep in mind what class your variables actually have! For example, people new to R often do not expect that text columns from data tables (e.g. csv or tsv) are converted to to factors by default when reading them into R.

Further reading

The R-Inferno has more on this and other common pitfalls when working with R. Don’t miss reading at least the free .pdf version. If I run into other examples, I will also write about them in the future as well.

Back from autumn vacation

We are back from our autum vacation in the Lüneburger Heide. Fortunately, we had autumn weather so that the kids could have fun. Beside from a rather short hiking trip to the Grundloses Moor, we spent some time in the nice garden and visited several other interesting locations in the area.

The kids & me at the Grundloses Moor in the Lüneburger Heide

The kids & me at the Grundloses Moor in the Lüneburger Heide

This includes the Weltvogelpark Walsrode which was a great success with the children, so that we went there twice.

Kids feeding lories in the Weltvogelpark Walsrode

Kids feeding lories in the Weltvogelpark Walsrode

On the day we drove back, we first spent some time in the Heidepark, which was also appreciated by all. Only my poor son was a little unhappy at times, that many of the rides were not open to kids younger than six years. I originally planned to do some work on this site and on the shiny interface for fbroc when I come back, but I decided to wait a week since the rest of the family had a bit more spare time due to the school holidays. Now that they are over I should have some time next week.

One week break

Due to vacation and other work related stuff, I will not be active for the next week either on this site or developing fbroc. When I come back the next thing to do is to upgrade the Shiny app; in addition to supporting the new features in fbroc 0.3 I want to check out the shiny dashboards package.