Monthly Archives: September 2015

Thoughts about Harrell’s Regression Modeling Strategies

This month I picked up the new second edition of the classic Regression Modeling Strategies by Frank E. Harrell, Jr. I had a copy of the first edition at my workplace for the last few years, but at work I didn’t have the time to really study it from cover to cover. The release of the second updated edition was a great opportunity to pick up an improved version of the book for myself.

Picture of the book cover of the Second Edition of Regression Modeling Strategies by Frank E. Harrell, Jr.

Book Cover of the Second Edition of Regression Modeling Strategies by Frank E. Harrell, Jr.

Since then I had time for a first cursory read, but at this moment I am just writing down some of my first impressions. So don’t consider this a full and informed book review as this point of time.

The book covers these three main areas:

  • General concerns when modelling like validation, feature selection and how to handle missing data
  • An Introduction to maximum likelihood estimation and the various types of regression models (e.g. logistic and survival models)
  • Case studies, in which the regression modeling strategies are applied to real problems

For me the case studies are the most valuable part of the book. They offer great insight into how an experienced analyst will be able to avoid many not-so-obvious pitfalls and get the most out of the collected data. The further reading sections at the end of each chapter are perfectly written to offer relevant excerpts and whet your appetite for more.

If I have the time for a full review, I will go into more detail later. For now, it suffices to say that there is a good reason Regression Modeling Strategies is considered a classical treatise on the subject of regression modelling strategies.

Another thing worth mentioning is the tight integration with R. While the presented strategies can be implemented in other languages, all the example code is in R. There is also a R-package rms accompanying the book. This new package replaces the old Design package, which has by now been removed from CRAN. There are many improvement, which are worth a post dedicated to rms alone.

Since R is my favorite statistics language this is a good for me, but unfortunate for people using other options like SAS, which is still the language of choice for most statisticians in the pharma business.

I do not completely share the enthusiasm of the author regarding the usage of splines. There are many situations where they are an excellent choice, but their interpretation is not straightforward. This especially holds when interactions are concerned.

This might be a matter of taste, but I do disagree with the recommendation of using significance tests for non-linearity to justify more complexity (said splines) to customers. I see the possible advantage, but this could well lead to the customer to later complain about using a parametric regression model whenever the Shapiro-Wilk test shows a slight deviation from normality.

While I already mentioned how valuable the further readings section are, there are some comments on the AUC and ROC curves in the second chapter by Briggs and Zaretzki which read like a bit of a strawman attack to me. To be precise, I have not yet fully read the cited papers themselves, so it is not unlikely that I will change my opinion later.

Even in that case, this comment is still relevant for the way that they are cited. In addition, as the author of the fbroc package I cannot deny a potential bias from my side.

They write:

The receiver of a diagnostic measurement… wants to make a decision based on some and is not especially interested in how well he would have done had he some different cutoff.

This is certainly true, but is also not the purpose for which a ROC curve should be used. For example, the ROC curve helps to decide whether a diagnostic test is better suited for confirming a suspicion or to screen a large population. In both cases, we have different prevalences of the disease. This in turn requires us to consider different trade-offs between sensitivity and specificity.

The ROC curve offers a good starting points for these considerations. It is not intended for the end-user, especially the patient receiving the result of a test. The same argument applies to the comment of David Hand:

The relative importance of misclassifying a case as a noncase, compared to the reverse, cannot come from the data itself. It must come externally, from considerations of the severity one attaches to the different kinds of misclassification.

Again, this is true. But this severity is often patient-specific and depends on the application and target population you are considering for the diagnostic test. The ROC curve should obviously not be the primary endpoint of the development of a routine diagnostic test.

Back from the Harz

Due to a thankfully mild eye infection I caught at the end of my vacation and the kids having their birthdays it has taken quite a bit longer than I announced before, but I will now able to continue developing fbroc and writing. Fortunately, our vacation in the Harz was great and I being well-rested always help my motivation.

The Harz is the highest mountain region (still not that high) in the northern part of Germany and in case you are going (or are considering to) there, here are some suggestions about what to do.

  1. Visit some mines. The Harz has a history of iron ore mining and every little town seems to have its own mine. It is very interesting to see how they worked there and there are often beautiful mineral formations as well. When you consider working conditions and the resulting life expectancy it makes one feel grateful to live here and now. Since you often have to stoop, I wouldn’t recommend if you have trouble with your back and are of above average height. Most of the mines also have museums to give the historical background.

  2. Not obviously for a mountain regions, there are lots of opportunities to go swimming. There are lots of lakes and small rivers, many built to supply all the mines with water to drive the wheels. Some of them are now open for bathing (often without payment). This was very nice on some of the hotter (>30 degrees Celsius) days.

  3. Above all, do some hiking. As an additional motivation, there is a stamp system, the Harzer Wandelnadel, where you get regarded with various badges, if you collect enough by hiking to various locations. As usual, this gamification also helps to motivate children. The stamps are usually well-placed, so that you are rewarded with a great view or a visit to a historical significant site.

There are also many other options, like visiting Goslar or the highest mountain in the region, the Brocken (1141 m). In addition, there are also many gondola lifts, summer toboggan runs and some nice castles. All in all, it’s well worth one or even several visits.

Castle in Wernigerode on a hill

Castle in Wernigerode