Using classes to write fbroc

By | May 17, 2015

Currently I am working on the next version of fbroc. One important new feature will be the analysis of paired ROC curves, which is important when you compare two classifiers. One example would be comparing some new diagnostic method to the state-of-the-art.

Before doing this I wanted to improve my C++ code. In the first version I didn’t use any C++ feature besides making use of the Vector classes in Rcpp. Without classes you usually need many more arguments per function and it is harder to reuse code efficiently. Implementing paired ROC curves is much more natural, if you look at paired ROC curves as a class containing two single ROC curve objects. This will make implementing them much more straightforward and the code more maintainable.

After realizing this I refactored the fbroc C++ code, so that everything related to the ROC curves was encapsulated in a class named ROC.

Performance

When I tested my first implementation of the ROC class on example data consisting of 1000 observations, I was a bit disappointed with the performance. The new code turned out to be about 30-40% slower than the old.

However, when I took a careful look at what went on, I found the reason: unnecessary memory allocations. In many cases I allocated a new NumericVector or IntegerVector for each bootstrap replicate, even though the size of the vector remains constant while bootstrapping.

As an example, compare the following code snippets. In both cases ‘index_pos’ and ‘index_neg’ are member of ‘ROC’.

With memory allocation:

void ROC::strat_shuffle(IntegerVector &shuffle_pos, IntegerVector &shuffle_neg) {
index_pos = NumericVector (n_pos);
index_neg = NumericVector(n_neg);
for (int i = 0; i < n_pos; i++) {
index_pos[i] = original_index_pos[shuffle_pos[i]];
}
for (int i = 0; i < n_neg; i++) {
index_neg[i] = original_index_neg[shuffle_neg[i]];
}
// recalculate ROC after bootstrap
reset_delta();
get_positives_delta();
get_positives();
get_rate();
}

Without memory allocation:

void ROC::strat_shuffle(IntegerVector &shuffle_pos, IntegerVector &shuffle_neg) {
for (int i = 0; i < n_pos; i++) {
index_pos[i] = original_index_pos[shuffle_pos[i]];
}
for (int i = 0; i < n_neg; i++) {
index_neg[i] = original_index_neg[shuffle_neg[i]];
}
// recalculate ROC after bootstrap
reset_delta();
get_positives_delta();
get_positives();
get_rate();
}

Benchmark

The graph below shows the performance of the new code vs the old used in fbroc 0.1.0.

The new version takes less time until we have more than 10000 samples per group. Afterwards both version perform the same.

Time used by fbroc 0.1.0 vs time used by the unreleased class-based version of fbroc

To generate it, I used a slightly modified version of the script used here.

Since the time for memory allocation is usually not dependent upon the size of the memory being allocated, the overhead stops to matter when the number of observations gets very large. But in the case that you have more than 10000 observations per group, you probably don’t need to bootstrap anyway.

Leave a Reply

Your email address will not be published. Required fields are marked *