Currently I am working on the next version of fbroc. One important new feature will be the analysis of paired ROC curves, which is important when you compare two classifiers. One example would be comparing some new diagnostic method to the state-of-the-art.
Before doing this I wanted to improve my C++ code. In the first version I didn’t use any C++ feature besides making use of the Vector classes in Rcpp. Without classes you usually need many more arguments per function and it is harder to reuse code efficiently. Implementing paired ROC curves is much more natural, if you look at paired ROC curves as a class containing two single ROC curve objects. This will make implementing them much more straightforward and the code more maintainable.
After realizing this I refactored the fbroc C++ code, so that everything related to the ROC curves was encapsulated in a class named ROC
.
Performance
When I tested my first implementation of the ROC
class on example data consisting of 1000 observations, I was a bit disappointed with the performance. The new code turned out to be about 30-40% slower than the old.
However, when I took a careful look at what went on, I found the reason: unnecessary memory allocations. In many cases I allocated a new NumericVector
or IntegerVector
for each bootstrap replicate, even though the size of the vector remains constant while bootstrapping.
As an example, compare the following code snippets. In both cases ‘index_pos’ and ‘index_neg’ are member of ‘ROC’.
With memory allocation:
void ROC::strat_shuffle(IntegerVector &shuffle_pos, IntegerVector &shuffle_neg) { index_pos = NumericVector (n_pos); index_neg = NumericVector(n_neg); for (int i = 0; i < n_pos; i++) { index_pos[i] = original_index_pos[shuffle_pos[i]]; } for (int i = 0; i < n_neg; i++) { index_neg[i] = original_index_neg[shuffle_neg[i]]; } // recalculate ROC after bootstrap reset_delta(); get_positives_delta(); get_positives(); get_rate(); }
Without memory allocation:
void ROC::strat_shuffle(IntegerVector &shuffle_pos, IntegerVector &shuffle_neg) { for (int i = 0; i < n_pos; i++) { index_pos[i] = original_index_pos[shuffle_pos[i]]; } for (int i = 0; i < n_neg; i++) { index_neg[i] = original_index_neg[shuffle_neg[i]]; } // recalculate ROC after bootstrap reset_delta(); get_positives_delta(); get_positives(); get_rate(); }
Benchmark
The graph below shows the performance of the new code vs the old used in fbroc 0.1.0.
To generate it, I used a slightly modified version of the script used here.
Since the time for memory allocation is usually not dependent upon the size of the memory being allocated, the overhead stops to matter when the number of observations gets very large. But in the case that you have more than 10000 observations per group, you probably don’t need to bootstrap anyway.