While working on the next version of fbroc I found a bug that caused fbroc to produce incorrect results when the numerical predictor variable includes ties, that is, if not every value is unique. I have already fixed this bug in the development version, so if you have ties in your predictor I recommend trying the development version instead of fbroc 0.1.0. Instructions for the installation can be found here.

More interestingly, the bug led me to consider some conceptual difficulties with the construction of ROC curves when ties are present.

# Constructing the ROC curve

### 1. Calculating TPR and FPR for all cutoffs present in the data

Let us first focus on the combinations of TPR and FPR that are actually achieved for the given data. Remember, that if we have n positive samples, for any cutoff the TPR must be a multiple of 1/n. Note that the following code snippet does not use the same logic or code as fbroc and that the R code is **much slower**.

# Function to calculate TPR/FPR pairs get.roc.curve <- function(pred, true.class) { #every cutoff gives the same result as one of these thresholds <- c(min(pred) - 1, sort(unique(pred))) tpr <- rep(1, length(thresholds)) # allocate space for true fpr <- rep(1, length(thresholds)) # and false positive rate n.pos = sum(true.class) n.neg = sum(!true.class) for (i in 1:length(thresholds)) { tpr[i] = sum(true.class & (pred > thresholds[i])) / n.pos fpr[i] = sum(!true.class & (pred > thresholds[i])) / n.neg } return(data.frame(TPR = tpr, FPR = fpr)) } # Example require(ggplot2) set.seed(123) true.class <- rep(c(TRUE, FALSE), each = 10) pred <- 1 * true.class + rnorm(20) #follows binormal model result <- get.roc.curve(pred, true.class) graph <- ggplot(data = result, aes(x = FPR, y = TPR)) + geom_point(size = 1.5) + xlim(0, 1) + theme_bw()

### 2. Connecting the dots

To get from this set of points to the actual ROC curve, we need to connect the dots. This raises a question: what should our TPR be at an FPR that is never actually assumed in the data? For example, if we have 80 negative samples there never is a threshold at which we have a FPR of exactly 1%. There are two strategies:

- Connect the dots next to each other by straight lines.
- Since a high TPR and a low FPR is always preferable, we define the TPR at a fixed FPR of 1% to be the highest TPR we observe at an FPR lower than or equal to 1%.

Without ties both strategies turn out to be equivalent.

Given the existence of ties, this is no longer the case. The following example will demonstrate the differences.

# An example where it matters

Let us generate more example data first:

true.class <- rep(c(TRUE, FALSE), each = 1000) pred <- ifelse(true.class, sample(7:14, 1000, replace = TRUE), sample(1:8, 1000, replace = TRUE))

There are many ties in that data, but the ties that matter are 7 and 8, because they include both positive and negative samples. If we look at the TPR/FPR combinations achieved, we obtain the following picture. Note the diagonal displacements in the upper left corner.

Depending on how we construct the ROC curve with the new development version of fbroc, we get two different shapes:

require(fbroc) require(gridExtra) pred <- as.numeric(pred) graph1 <- plot(boot.roc(pred, true.class, n.boot = 1e4, tie.strategy = 1), show.conf = FALSE) graph2 <- plot(boot.roc(pred, true.class, n.boot = 1e4, tie.strategy = 2), show.conf = FALSE) grid.arrange(graph1, graph2, ncol = 2)

From the point of view of the end user (e.g. a doctor making a diagnosis), I strongly prefer the second strategy. If the user asks for the TPR at a fixed FPR of 0.1 and gets a result of 0.8, he assumes that this is a realistic combination of TPR and FPR.

With the second strategy the user is assured that a cutoff exists, for which we obtain a TPR of 0.8 and a FPR smaller than or equal to 0.1. That is, there is a cutoff at which the performance is *as least as good* as stated. On the other hand, with the first strategy it is likely that the user will have to settle either for a smaller TPR or a higher FPR than reported.

Most other packages (e.g. pROC and ROCR) seem to use the first strategy. One possible reason is, that the Area Under the Curve (AUC) is equivalent to the Mann-Whitney U statistic only when we follow the first approach.

# How does fbroc handle this situation?

As written above, I personally favor the second way of connecting the dots, since I consider it to more faithful to what we want ROC curve to represent. After quite a bit of thinking this is what I decided:

- The ROC curve and its confidence region will be visualized using the second strategy when there are ties by default. If you want, you can force fbroc to use the first strategy instead.
- When estimating the TPR at a fixed FPR or vice versa, the second strategy will always be used.
- When calculating the AUC, fbroc will use the first option instead. This way the AUC remains equivalent to the Mann-Whitney U statistic and to that given by other R packages (ROCR, pROC).

The drawback is, that in the presence of ties, the Area under the Curve will no longer be equal to the area under the ROC curve as plotted. Also if you force fbroc to use strategy 1 to draw the ROC curve, the confidence region is currently still based on strategy 2. This can lead to the ROC curve being on the outside of the confidence region.

I will consider allowing the user to configure this behavior more to his or her liking, but for now this seems like a reasonable compromise. This choice seems to be unique to fbroc, since both pROC and ROCR follow the first strategy instead. In the meantime, remember that this makes absolutely no difference if you do not have any ties in your predictor.