Monthly Archives: May 2015

ROC curves and ties

While working on the next version of fbroc I found a bug that caused fbroc to produce incorrect results when the numerical predictor variable includes ties, that is, if not every value is unique. I have already fixed this bug in the development version, so if you have ties in your predictor I recommend trying the development version instead of fbroc 0.1.0. Instructions for the installation can be found here.

More interestingly, the bug led me to consider some conceptual difficulties with the construction of ROC curves when ties are present.

Constructing the ROC curve

1. Calculating TPR and FPR for all cutoffs present in the data

Let us first focus on the combinations of TPR and FPR that are actually achieved for the given data. Remember, that if we have n positive samples, for any cutoff the TPR must be a multiple of 1/n. Note that the following code snippet does not use the same logic or code as fbroc and that the R code is much slower.

# Function to calculate TPR/FPR pairs
get.roc.curve <- function(pred, true.class) {
  #every cutoff gives the same result as one of these
  thresholds <- c(min(pred) - 1, sort(unique(pred))) 
  tpr <- rep(1, length(thresholds)) # allocate space for true 
  fpr <- rep(1, length(thresholds)) # and false positive rate
  n.pos = sum(true.class)
  n.neg = sum(!true.class)
  for (i in 1:length(thresholds)) {
    tpr[i] = sum(true.class & (pred > thresholds[i])) / n.pos
    fpr[i] = sum(!true.class & (pred > thresholds[i])) / n.neg
  }
  return(data.frame(TPR = tpr, FPR = fpr))
}
# Example
require(ggplot2)
set.seed(123)
true.class <- rep(c(TRUE, FALSE), each = 10) 
pred <- 1 * true.class + rnorm(20) #follows binormal model
result <- get.roc.curve(pred, true.class)
graph <- ggplot(data = result, aes(x = FPR, y = TPR)) + 
         geom_point(size = 1.5) + xlim(0, 1) +
         theme_bw() 
Shows combinations of TPR/FPR that are achieved for different cutoffs given the example data.

Shows combinations of TPR/FPR that are achieved for different cutoffs given the example data.

2. Connecting the dots

To get from this set of points to the actual ROC curve, we need to connect the dots. This raises a question: what should our TPR be at an FPR that is never actually assumed in the data? For example, if we have 80 negative samples there never is a threshold at which we have a FPR of exactly 1%. There are two strategies:

  1. Connect the dots next to each other by straight lines.
  2. Since a high TPR and a low FPR is always preferable, we define the TPR at a fixed FPR of 1% to be the highest TPR we observe at an FPR lower than or equal to 1%.

Without ties both strategies turn out to be equivalent.

ROC curve consisting of connected points

The points have been connected to form the full ROC curve. Both strategies yield the same curve.

Given the existence of ties, this is no longer the case. The following example will demonstrate the differences.

An example where it matters

Let us generate more example data first:

true.class <- rep(c(TRUE, FALSE), each = 1000)
pred <- ifelse(true.class, sample(7:14, 1000, replace = TRUE),
                           sample(1:8, 1000, replace = TRUE))

There are many ties in that data, but the ties that matter are 7 and 8, because they include both positive and negative samples. If we look at the TPR/FPR combinations achieved, we obtain the following picture. Note the diagonal displacements in the upper left corner.

TPR/FPR points are not next to each other on a vertical or horizontal line, due to the presence of ties.

TPR/FPR combinations that are achieved for a predictor with ties. This causes the diagonal displacement between neighboring points.

Depending on how we construct the ROC curve with the new development version of fbroc, we get two different shapes:

require(fbroc)
require(gridExtra)
pred <- as.numeric(pred)
graph1 <- plot(boot.roc(pred, true.class, n.boot = 1e4, 
                        tie.strategy = 1), show.conf = FALSE)
graph2 <- plot(boot.roc(pred, true.class, n.boot = 1e4, 
                        tie.strategy = 2), show.conf = FALSE)
grid.arrange(graph1, graph2, ncol = 2)
Two roc curves, one with diagonal connections, one with discontinuities

ROC curves with ties drawn with fbroc. The left-hand graph is drawn using strategy 1, the right-hand graph is drawn using strategy 2 (fbroc default).

From the point of view of the end user (e.g. a doctor making a diagnosis), I strongly prefer the second strategy. If the user asks for the TPR at a fixed FPR of 0.1 and gets a result of 0.8, he assumes that this is a realistic combination of TPR and FPR.

With the second strategy the user is assured that a cutoff exists, for which we obtain a TPR of 0.8 and a FPR smaller than or equal to 0.1. That is, there is a cutoff at which the performance is as least as good as stated. On the other hand, with the first strategy it is likely that the user will have to settle either for a smaller TPR or a higher FPR than reported.

Most other packages (e.g. pROC and ROCR) seem to use the first strategy. One possible reason is, that the Area Under the Curve (AUC) is equivalent to the Mann-Whitney U statistic only when we follow the first approach.

How does fbroc handle this situation?

As written above, I personally favor the second way of connecting the dots, since I consider it to more faithful to what we want ROC curve to represent. After quite a bit of thinking this is what I decided:

  1. The ROC curve and its confidence region will be visualized using the second strategy when there are ties by default. If you want, you can force fbroc to use the first strategy instead.
  2. When estimating the TPR at a fixed FPR or vice versa, the second strategy will always be used.
  3. When calculating the AUC, fbroc will use the first option instead. This way the AUC remains equivalent to the Mann-Whitney U statistic and to that given by other R packages (ROCR, pROC).

The drawback is, that in the presence of ties, the Area under the Curve will no longer be equal to the area under the ROC curve as plotted. Also if you force fbroc to use strategy 1 to draw the ROC curve, the confidence region is currently still based on strategy 2. This can lead to the ROC curve being on the outside of the confidence region.

Two roc curves, one with diagonal connections, one with discontinuities

ROC curves with ties drawn with fbroc. The left-hand graph is drawn using strategy 1, the right-hand graph is drawn using strategy 2 (default). Confidence region shown are always based on the second strategy.

I will consider allowing the user to configure this behavior more to his or her liking, but for now this seems like a reasonable compromise. This choice seems to be unique to fbroc, since both pROC and ROCR follow the first strategy instead. In the meantime, remember that this makes absolutely no difference if you do not have any ties in your predictor.

Getting highlighting with SyntaxHighlighter Evolved to work

I had some problem getting syntax highlighting to work correctly on this site. First, I tried using a combination of Jetpack Markdown and SyntaxHighlighter Evolved to write my posts. But when I previewed or published the post, several special character were turned into html code. For example, the less symbol < was rendered as &lt; instead. At one point it even corrupted my post, so that the html code was permanently saved into my Markdown text. Fortunately, WordPress allows you to go back to earlier revisions of your posts.

Google showed that the problem is not unknown, but I did not find a solution for my problem quickly. So I tried out WP Code Prettify instead. This worked but I didn’t like the look of it too much, so I experimented some more and finally got SyntaxHighlighter Evolved to work correctly.

Instead of using the triple backticks ``` I had to start the code block with the language specific command, e.g. cpp in square brackets for C++. Interestingly enough, just one code block in this form also caused all other original code blocks to render correctly, even though I had not changed them yet. I suppose it triggered some sort of different preprocessing of my post. Hopefully, this will be fixed in the future so that you can use more standard Markdown instead.

Maybe this will help someone with the same problem in the future? Anyway, it also showed me that I have to edit my code blocks so that you don’t need to use the scroll bar. Personally, I like to work with up to 100 characters per line instead of the usual 80, but this seems to be too many for the line to display in WordPress without scrolling.

An example

A C++ code block rendered with SyntaxHighlighter Evolved

A C++ code block rendered with SyntaxHighlighter Evolved

Using classes to write fbroc

Currently I am working on the next version of fbroc. One important new feature will be the analysis of paired ROC curves, which is important when you compare two classifiers. One example would be comparing some new diagnostic method to the state-of-the-art.

Before doing this I wanted to improve my C++ code. In the first version I didn’t use any C++ feature besides making use of the Vector classes in Rcpp. Without classes you usually need many more arguments per function and it is harder to reuse code efficiently. Implementing paired ROC curves is much more natural, if you look at paired ROC curves as a class containing two single ROC curve objects. This will make implementing them much more straightforward and the code more maintainable.

After realizing this I refactored the fbroc C++ code, so that everything related to the ROC curves was encapsulated in a class named ROC.

Performance

When I tested my first implementation of the ROC class on example data consisting of 1000 observations, I was a bit disappointed with the performance. The new code turned out to be about 30-40% slower than the old.

However, when I took a careful look at what went on, I found the reason: unnecessary memory allocations. In many cases I allocated a new NumericVector or IntegerVector for each bootstrap replicate, even though the size of the vector remains constant while bootstrapping.

As an example, compare the following code snippets. In both cases ‘index_pos’ and ‘index_neg’ are member of ‘ROC’.

With memory allocation:

void ROC::strat_shuffle(IntegerVector &shuffle_pos, IntegerVector &shuffle_neg) {
index_pos = NumericVector (n_pos);
index_neg = NumericVector(n_neg);
for (int i = 0; i < n_pos; i++) {
index_pos[i] = original_index_pos[shuffle_pos[i]];
}
for (int i = 0; i < n_neg; i++) {
index_neg[i] = original_index_neg[shuffle_neg[i]];
}
// recalculate ROC after bootstrap
reset_delta();
get_positives_delta();
get_positives();
get_rate();
}

Without memory allocation:

void ROC::strat_shuffle(IntegerVector &shuffle_pos, IntegerVector &shuffle_neg) {
for (int i = 0; i < n_pos; i++) {
index_pos[i] = original_index_pos[shuffle_pos[i]];
}
for (int i = 0; i < n_neg; i++) {
index_neg[i] = original_index_neg[shuffle_neg[i]];
}
// recalculate ROC after bootstrap
reset_delta();
get_positives_delta();
get_positives();
get_rate();
}

Benchmark

The graph below shows the performance of the new code vs the old used in fbroc 0.1.0.

The new version takes less time until we have more than 10000 samples per group. Afterwards both version perform the same.

Time used by fbroc 0.1.0 vs time used by the unreleased class-based version of fbroc

To generate it, I used a slightly modified version of the script used here.

Since the time for memory allocation is usually not dependent upon the size of the memory being allocated, the overhead stops to matter when the number of observations gets very large. But in the case that you have more than 10000 observations per group, you probably don’t need to bootstrap anyway.

fbroc 0.1.0 release

Last week I released my first R-package fbroc. Using fbroc you can use bootstrap analysis to quickly calculate confidence regions for the curve itself as well as derived performance metrics like the AUC.

Getting the package published was less work than I expected, because the new book by Hadley Wickham R packages was a big help in having my very first submission ever accepted. I strongly recommend it to everyone just getting started with R-package authorship.

The confidence interval for the AUC is [0.75, 0.8].

fbroc visualization of the ROC curve and the AUC including confidence intervals.

When you should use fbroc

Considering the number of R-packages on CRAN, it is of no surprise that there are several other packages for ROC curve analysis. Options include using boot (with ROCR) or the excellent pROC package.

My main priority in writing fbroc was to outperform the alternate options in terms of speed. If you just want to bootstrap a single ROC curve, then pROC is probable the better choice at this point of time, mainly due to its rich feature set.

However, when conducting simulation studies with a larger number of bootstrap replicates or when you want a fast response time, I hope that you will be happier with fbroc. In fact, one main motivation was the embedding of fbroc in a Shiny app, which doesn`t force the user to wait a few minutes after uploading data. My own implementation is already hosted here on my Shiny server.

For information on how to start using fbroc in your R scripts, please go here.

Benchmark

A short R-script will serve nicely to demonstrate the performance of fbroc relative to pROC and boot.

require(fbroc)
require(pROC)
require(boot)
require(ROCR)
require(ggplot2)

roc.stat <- function(x, index, true.class) {
  performance(prediction(x[index], true.class[index]), "auc")@y.values[[1]]
}

boot.with.ROCR <- function(x, y) {
  boot.ci(boot(data = x, statistic = roc.stat, R = 1000,
               strata = y, true.class = y), type = "perc")
}

n.seq <- c(5:25, seq(30, 50, 5), 65, 75, 100, 130, 150, 175, 250, 375, 500, 750, 
          1000, 1250, 1500, 2000, 2500, 3000, 4000, 5000)
length.seq <- length(n.seq)

time.fbroc <- time.pROC1 <- time.pROC2 <- time.pROC3 <- 
              time.boot.ROCR <- rep(0, length.seq)

for (i in 1:length.seq) {
  n <- n.seq[i]  # samples per group
  y <- rep(c(TRUE, FALSE), each = n)
  x <- rnorm(2*n) + 1.5 * y
  time.fbroc[i] <- system.time(perf.roc(boot.roc(x, y, n.boot = 1000), "auc"))[3]
  time.pROC1[i] <- system.time(ci.auc(roc(y, x, algorithm = 1), method = "bootstrap", 
                                      boot.n = 1000, progress = "none"))[3]
  time.pROC2[i] <- system.time(ci.auc(roc(y, x, algorithm = 2), method = "bootstrap", 
                                      boot.n = 1000, progress = "none"))[3]
  time.pROC3[i] <- system.time(ci.auc(roc(y, x, algorithm = 3), method = "bootstrap", 
                                      boot.n = 1000, progress = "none"))[3]
  time.boot.ROCR[i] <- system.time(boot.with.ROCR(x, y))[3]
}
fbroc outperforms all other options by at least one order of magnitude

Log-log plot showing a benchmark of the calculation time for pROC, ROCR and fbroc given different sample sizes.

Note that with low n overhead is the dominant factor for fbroc, ROCR and pROC algorithm 3. Good asymptotic behavior is demonstrated by fbroc, ROCR and pROC algorithm 2.

One nice feature of fbroc is that it is consistently better by at least one order of magnitude, while the relative performance of the other methods depend on sample size.

I will write more about the algorithm used by fbroc in the future, but it basically comes down to using an algorithm scaling linearly with the number of observations and implementing it efficiently in C++ via the package Rcpp.

Current features and future development

At the moment fbroc offers only a rather limited number of features:

  • Very fast bootstrapping of ROC curves.
  • Visualization of confidence regions for the ROC curve.
  • Analysis of the AUC including confidence intervals.

I will expand the scope of fbroc in the future and am already working on the next version which should hopefully be out early next month. However, I wanted to have the first version including the fast C++ algorithm on CRAN as quickly as possible as I consider it the core feature of the package.

My long-term vision for fbroc includes:

  • Paired ROC curve analysis
  • Power calculations
  • Cutoff optimization
  • Additional performance metrics (e.g. partial AUC)
  • Low-memory mode

Bold bullet points are slated to be included in the next release.

The Shiny application built on top of fbroc will also be updated to support and implement the new features offered in future releases.

In the next weeks, I will also write a series of posts that will go into more detail about ROC curves and the algorithm implemented in fbroc.

Closing note

Since this is also my first post on this page, I will be especially grateful for any comments and suggestions on how to improve my writing style.