A guide to surviving the analysis of survival data¶

Table of contents¶

The survival data problem – to survive or not to survive
Simulating survival data
Survival data – first look
The path to self destruction – simple stats on survival data
Drawing conclusions – so which one is better: Genisys or Cyberdyne?
Proper way to analyze survival data
Common misconceptions and traps to avoid
Final conclusion

The survival data problem – to survive or not to survive¶

Time after time, place after place, I see the same errors and misconceptions repeated in analysis of Survival data. I see so many people get this wrong – people who are smarter than me… Eventually I started to question my own understanding. Perhaps it is me, after all, who got it wrong? Maybe it is something simple that everybody knows and I don’t?

There is only one way to solve this dilemma: by analyzing actual data both ways: the popular way, and the way I think is right.

But first, we need a set of data to analyze: survival data with full information about every subject. Fortunately, it is not too difficult to simulate such a data set using R and the library “coxed”.

Let us go ahead and do that.

Simulating survival data¶

First step: I need to decide what I am going to actually simulate. Here’s the story that I made up for this purpose:

A mid-sized corporation is setting up their own data center. They’ve got 25 servers, and each of the servers can accommodate 12 hard drives. That makes a total of 300 hard drives they need to purchase. Obviously, they’re looking for good quality drives that will last a long time, on which their data would be safe.

There are two vendors of hard drives with good reputation: Genisys and Cyberdyne. The company decides to purchase a batch from each vendor: 180 from Genisys and 120 from Cyberdyne, and evaluate the quality along the way – in terms of time-to-failure. Of course the HDs have identical capacity and all other parameters.

The 25 servers are assembled, placed in a specially prepared room and mounted in 5 racks – 5 servers in each rack. Everything is connected, switched on, and the evaluation begins. Each hard disk failure is carefully noted. In practice, each failed hard drive would be replaced, but I’m not interested in that. My focus is entirely on the first batch of 300 hard drives that were initially purchased.

OK, let’s do it! First I need to launch R (of course with all required libraries installed – topic not covered here), and set up the environment:

library("coxed")

The simulation involves generation of random numbers. In case anyone wanted to repeat my simulation to verify if I did not cheat or something, I am going to anchor the random number generator to start with the same sequence every time it is run – this even works across PC and Mac computers.

set.seed(102)

I’m now ready to simulate the data: for 300 hard drives, with failure times between 0 and 80 months (I thought that would be realistic).

simdata <- sim.survdata(N=300, T=80, num.data.frames=1)
survsim.plot(simdata)

TableGrob (2 x 1) "arrange": 2 grobs
  z     cells    name           grob
1 1 (1-1,1-1) arrange gtable[layout]
2 2 (2-2,1-1) arrange gtable[layout]

Looking at the “simulated durations” (last histogram) it seems to me that there are too many failures in the first 18 months – this makes the data a bit unrealistic in my scenario, and I want this to be as much realistic as possible. To fix this, I could do two things:

Play with the set.seed() numbers and test until I see a distribution that suits me
Adjust the existing distribution

I’ve tried both approaches, and ended up doing #2, as it is quicker. I shifted the failure times from 0-20 months to ~70 months. To make this perfect, separately I also moved those from 20-50 months again to ~70 months. In the end, I also took care of some “stray” failure times that ended up above 120 months.

a = simdata$data
a$month = a$y
a$month[a$month<20] = a$month[a$month<20]+round(rnorm(nrow(a[a$month<20,]), 70,35),0)
a$month[a$month<50] = a$month[a$month<50]+round(rnorm(nrow(a[a$month<50,]), 20,10),0)
a$month[a$month>120] = a$month[a$month>120]-100
hist(a$month, breaks=30, xlim=c(0,120), xlab="Failure month", main="")

So, here is how the simulated failure times look like. We have a couple of failures under 24 months (2 years), which is realistic, and the majority of failures happening between 50-60 months (4-5 years). We also have a tail of longer-lasting drives, with the last ones failing at 10 years. That distribution is good enough for me.

We now need some accompanying information: hard drive brand, and server and rack in which it is located. Let’s add these:

a$brand = "Genisys"
a$brand[181:300] = "Cyberdyne"
a$brand = factor(a$brand)

a$server = rep(LETTERS[1:25], each=12)
a$server = factor(a$server)

a$rack = rep(paste("R",1:5, sep=""), each=60)
a$rack = factor(a$rack)

At this point we have a complete set of survival data with full information about every subject – a hard drive in this case. For each hard drive we know the brand, server location – including rack, and the month of failure.

head(a[,6:ncol(a)])

Such a data set would actually be quite easy to analyze without employing any sophisticated techniques. So what’s the hype all about?

Well, the unique taste of survival data comes from “censoring” – that means: loss of follow up of certain subjects. I don’t have that in my simulated data yet, however in practice it is very common that a subject drops out of the study before it “fails”, and further fate of that subject is simply unknown.

Let’s spruce up our simulated data with some censoring. Let’s imagine an event that would make our mid-sized corporation replace certain hard drives before they failed, or an event that would cause them to fail – but not on their own, like something not covered by warranty.

How about a fire or flooding?

OK, let’s say that the roof over the server room happened to leak after a huge storm at 41 months into the evaluation, and unfortunately rack “R3” was affected. A couple of servers in that rack got wet, there was a short circuit and a power surge. In the end, the company decided to replace all servers and hard drives in that rack. Let’s not worry about a data loss (it must have happened), and focus just on the hardware failure data. Since the drives did not fail on their own, due to manufacturing, they essentially survived 41 months and are “censored” at this point:

a$censored = a$month
a$censored[a$rack=="R3" & a$month>41]= 41
head(a[a$rack=="R3",6:ncol(a)])
nrow(a[a$rack=="R3" & a$month>=41,])

Therefore, in the rack “R3” the maximum observed life time of any hard drive would be 41, as shown by the variable censored. In total, there are 48 hard drives “censored” – that is removed from evaluation before they failed.

OK.

Let’s say that at present time we are at month 72 of the evaluation, so all hard drives that have not failed to date are “survivors”. The maximum observed life time of any hard drive is naturally 72 months. Let’s put that in the data:

a$censored[a$censored>72] = 72

a$simulated = a$month
a$simulated[a$simulated>72] = 72

Just a note: above I am creating two “failure time” variables. Both end at 72 months, but one contains censored data (rack “R3” replaced at 41 months) and the other contains full data (as if nothing happened to rack “R3”) so that I can compare them later.

To complete my preparation work on the simulated data set, it is time now to calculate the status of each hard drive from the original simulated data:

# Study status at the moment of evaluation
a$status = "failed"
a$status[a$month>=72] = "survived"
a$un_status = a$status
a$un_status = factor(a$un_status)

# Add censoring
a$status[a$month>=41 & a$rack=="R3"] = "survived"
a$status = factor(a$status)

And this completes my work on preparing the data for analysis. Let’s have a look at it.

Survival data – first look¶

Let’s see how this looks like with basic stats, and also compare the simulated and observed failure times:

summary(a[,6:ncol(a)])
plot(a$censored~a$simulated, xlab="Simulated failure time [month]", ylab="Observed failure time [month]")

     month              brand         server    rack       censored    
 Min.   :  5.00   Cyberdyne:120   A      : 12   R1:60   Min.   : 5.00  
 1st Qu.: 47.00   Genisys  :180   B      : 12   R2:60   1st Qu.:41.00  
 Median : 56.00                   C      : 12   R3:60   Median :53.00  
 Mean   : 58.17                   D      : 12   R4:60   Mean   :52.47  
 3rd Qu.: 65.00                   E      : 12   R5:60   3rd Qu.:63.25  
 Max.   :120.00                   F      : 12           Max.   :72.00  
                                  (Other):228                          
   simulated          status       un_status  
 Min.   : 5.00   failed  :211   failed  :251  
 1st Qu.:47.00   survived: 89   survived: 49  
 Median :56.00                                
 Mean   :55.03                                
 3rd Qu.:65.00                                
 Max.   :72.00

It is clear from the above plot that what I actually observe (observed failure time) is quite different from the underlying reality (simulated failure time), and the reason for that difference is censoring. And keep in mind this is the simplest possible case: with just one censoring event. In real-life subjects are “censored” at random times and for various reasons, mostly unknown.

To deal with such situations is why the survival analysis methods were invented. But before I get there, let’s take a bit more look at some simple stats and compare censored (observed) vs un-censored (simulated) data.

The path to self destruction – simple stats on survival data¶

Yes, first of all, let’s keep things simple. Simple things are easy to do, easy to explain and easy to grasp by broad audience.

In practice many researchers like to present the survival information by survival status (failed/survived) at the end of their study. For example, they may want to show the final survival numbers for each of the evaluated groups – just to summarize the data. Here is how this would look like in my case:

table(a$brand, a$status)

           
            failed survived
  Cyberdyne     98       22
  Genisys      113       67

Now that is really something! The conclusion just pops into my eyes after the first glance on the numbers. Isn’t it obvious that only about 1/5 of Cyberdyne drives “survived”, compared to over 1/2 of Genisys drives?

We should be buying Genisys only from now on!

Someone might say: wait a second! That might be too easy, let’s actually look at survival times. Of course under the evaluation conditions I only have the “censored” times available to me. Here is how a simple table would look like:

mesdcip(a$censored, list(a$brand), all=F)

Uh oh! Now it looks like Cyberdyne drives worked on average 6 months longer (10 months if we go by median) than Genisys.

It turns out Genisys is a disaster – we should be buying Cyberdyne drives only from now on! Right?

Well, I don’t know about you, but I am getting a bit confused by now. Are we really looking at the same data? Two valid statistical methods and two totally different conclusions? Something smells here…

Perhaps I need to dig a bit deeper. Let’s see how this looks by survival status:

mesdcip(a$censored, list(a$brand, a$status), all=F)

Well, clearly even the Cyberdyne failures had longer survival time than Genisys survivors! In fact, Genisys failures had better survival time than Genisys survivors!

How can this be!?

To illustrate what it is that smells with this approach, let’s take a peek at the original un-censored data. Keep in mind that this is usually not available to researchers – this is what would have happened if there were no such things like floods and fires.

mesdcip(a$simulated, list(a$brand, a$un_status), all=F)

Comparing the two tables I now see a big difference in number of survivors and survival time for Genisys “survivors”. Note that both Genisys and Cyberdyne survival times for “failures” are also less different now.

The question that comes in naturally: why?

Let’s think about number of survivors and failure time separately.

Number of survivors. In the censored data for Genisys it looks like I have 153-113=40 more failed hard drives, and consequently 67-27=40 less survivors. This many hard drives: 40, were simulated to fail between 41 and 72 months – located in the rack “R3”, and all of this information is lost in the censored data set.
Survival/failure time. In the censored table I have a large group of 48 hard drives with survival time = 41 – the time of censoring due to water leak, and this brings down the average observed survival time.

All of this just because of censoring!

Drawing conclusions – so which one is better: Genisys or Cyberdyne?¶

The quick answer is: neither. They’re essentially the same – I created the data this way! Remember that I used the same hazard function and a single distribution of failure times for both manufacturers. In fact, the failure data was simulated first and manufacturer info was added later. There is no way the failure times would be substantially different between the brands.

Let’s use a simple statistic again, but again take a peek at the un-censored, simulated data:

mesdcip(a$simulated, list(a$brand), all=F)

If you go back to the same table made using censored data [13] you will notice a significant difference between the manufacturers, with p-value of 3.26E-5. Very significant.

However using the un-censored data I got a p-value of 0.25 – not significant at all!

Then where would the significant difference in the censored data come from? Since it does not come from the underlying reality, it must be due to censoring!

Didn’t I already say that?

Proper way to analyze survival data¶

Yes, you guessed it – it is the “survival analysis”. This is exactly why it was invented: because the normal stats failed in the face of censoring.

There are many methods in the “survival analysis” domain. I am going to use Cox Proportional Hazards model, which is a very robust method (I encourage everyone to read more on it), and is easily accessible to R programmers with the coxph function that works on a survival object. Let’s compare the two brands using this technique:

fit = coxph(Surv(censored, status=="failed")~strata(brand), data=a, cluster=rack)
plot(survfit(fit), xlab="Time [month]", ylab="Survival", col=1:2)

In the above plot the two lines represent failures of Genisys and Cyberdyne brand hard drives. It doesn’t actually matter which is which, because the lines mostly overlap. As you see, the Cox proportional hazard method nicely took care of the censored records. No matter how you look at this, you can’t really say that one brand is better than the other – and this is exactly how I simulated it.

**Let this serve as an illustration of why you should not do regular stats on survival data.**

Common misconceptions and traps to avoid¶

Let me start by taking a look at a figure illustrating the status of all 300 hard drives throughout the duration of the evaluation (72 months). The graph shows three colored areas, but please try to imagine monthly “columns” in it.

Bear with me here, as I need to go through several steps of coding to create this visualization:

# Create a fitted survival data-set
sfit = survfit(Surv(a$censored, a$status=="failed")~1, a)
z = data.frame(time=c(0,sfit$time), n=c(sfit$n.risk,41))
z$dropout = 0
z$dropout[z$time>=41] = nrow(a[a$rack=="R3" & a$month>=41,])
z$failed = 300 - z$n - z$dropout
z$survived = 300 - z$failed - z$dropout
z$sum = z$failed + z$survived + z$dropout

# Prepare area coordinates
xs  = rep(z$time, each=2)[2:(nrow(z)*2)]
ysf = rep(z$failed, each=2)[1:(nrow(z)*2-1)]
ysd = rep(300-z$dropout, each=2)[1:(nrow(z)*2-1)]

# Make the plot
plot(0,0, xlim=c(0,72), ylim=c(0,300), type="n", xlab="Time [months]", ylab="Subjects (hard drives)")
polygon(x=c(xs, rev(xs)), y=c(ysf,rep(0, length(ysf))), col=2)
polygon(x=c(xs, rev(xs)), y=c(ysf,rev(ysd)), col=3)
polygon(x=c(xs, rev(xs)), y=c(ysd,rep(300, length(ysd))), col=4)
legend(5,280, c("Survivors","Failures","Drop-outs"), fill=c(3,2,4), bg="white")

In the beginning of our evaluation (at time=0) all hard drives work – none has failed yet. We have all survivors, and the line you would trace from top to bottom of the plot (a monthly column) is all green.

This continues until month 5, when we have the first simulated failure. From that point in time on, until 41 months, we have 2 groups: survivors and failures. Meaning that each monthly column in that range is part green and a small part red.

At 41 months I lost many hard drives to follow-up, their further fate is unknown. A third group is therefore introduced: drop-outs (blue). This continues until month 72 – the point of observation from which we’re making our analysis. We still have 3 groups at this point, that is the monthly column has 3 colors: some blue, some green and quite a bit of red.

If we were to continue our evaluation until the last hard drive failed, we would end up at month 120, and we would have 2 groups: drop-outs and failures. No survivors any more.

Now, if you think in the context of the table (from [12]):

table(a$brand, a$status)

           
            failed survived
  Cyberdyne     98       22
  Genisys      113       67

and compare it with the above graph – why was this table misleading?

For one primary reason: censoring was ignored.

The summary in that table is taken by the censored status at 72 months. However, as is clear from the above plot, at 72 months we have three groups, not two! Here is how the correct table would look like:

a$censored_status = as.character(a$status)
a$censored_status[a$status=="survived" & a$rack=="R3"] = "drop-out"
a$censored_status = ordered(a$censored_status, levels=c("survived","failed","drop-out"))

table(a$brand, a$censored_status)

           
            survived failed drop-out
  Cyberdyne       22     98        0
  Genisys         19    113       48

Now we see that the numbers of survived and failed drives are roughly similar across brands, although still hinting at some difference.

What about the researchers that want to compare the “survivors” to “failures” and look for further clues there?

I now know that I must not ignore the drop-out group, so let’s add that group and see if this fixes the problem with failure time stats:

mesdcip(a$censored, list(a$brand, a$censored_status), all=F)

Well, in a way this looks better. You may be tempted to say: we don’t need ‘survival analysis’ methods after all, just need to remember that there are 3 groups, not 2.

Really? Let’s try to draw conclusions from such a table. Let’s focus for a moment on Genisys drives alone:

mesdcip(a$censored[a$brand=="Genisys"], list(a$censored_status[a$brand=="Genisys"]), all=F)

How about a conclusion: the failed drives are different from both: survived and dropped out, and the differences are statistically significant? Sounds valid based on the table, but does it make any sense?

If we analyzed other covariates (rack, server) in the data set, we could probably speculate about their influence on the risk of a drive dropping-out of evaluation…

Furthermore, all of the above stats still point to significant differences between groups, while we know that by design of this data set, there aren’t such things in it.

Let’s go back to survival analysis then and do this formally:

fit = coxph(Surv(censored, status=="failed")~brand, data=a, cluster=server)
summary(fit)

Call:
coxph(formula = Surv(censored, status == "failed") ~ brand, data = a, 
    cluster = server)

  n= 300, number of events= 211 

               coef exp(coef) se(coef) robust se     z Pr(>|z|)
brandGenisys 0.1145    1.1213   0.1386    0.1226 0.934     0.35

             exp(coef) exp(-coef) lower .95 upper .95
brandGenisys     1.121     0.8918    0.8819     1.426

Concordance= 0.516  (se = 0.016 )
Likelihood ratio test= 0.68  on 1 df,   p=0.4
Wald test            = 0.87  on 1 df,   p=0.4
Score (logrank) test = 0.68  on 1 df,   p=0.4,   Robust = 0.85  p=0.4

  (Note: the likelihood ratio and score tests assume independence of
     observations within a cluster, the Wald and robust score tests do not).

In the above output you only see the “Genisys” brand – this is because the “Cyberdyne” brand is used as the point of reference, for which the Hazard Ratio (“exp(coef)”) is 1. In any case, the probability that the brands are different: Pr(>|z|) is 0.35, which is not significant. And that is correct.

And in another way:

fit = survfit(Surv(censored, status=="failed")~brand, data=a, cluster=server)
fit

Call: survfit(formula = Surv(censored, status == "failed") ~ brand, 
    data = a, cluster = server)

                  n events median 0.95LCL 0.95UCL
brand=Cyberdyne 120     98     57      55      61
brand=Genisys   180    113     56      54      58

You can see that median failure times for Cyberdyne and Genisys drives are almost identical and confidence intervals almost completely overlap. This leads to the same conclusion as above: you can go ahead and buy either brand of hard drives – makes no difference at all.

Final conclusion¶

This little guide does not go into too many details on how to do proper analysis of survival data. Perhaps another time… My goal here was only to show you how not using survival methods can lead you to false, often contradicting conclusions, and conclusions that simply do not make much sense.

In the end, there is no escape from survival analysis methods. Never forget that! Don’t go the popular way, go the proper way!

If you need help, I am available for consulting.

group	N	Mean	SD	CV	Median	LCI	UCI	vs.Cyberdyne	vs.Genisys	Kruskal.p
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
Cyberdyne	120	55.98	13.14	23.48	57	28.98	72	1	0.000177071928296748	3.26224102532256e-05
Genisys	180	50.13	12.87	25.68	47	27	72	0.000177071928296748	1	3.26224102532256e-05

group	group	N	Mean	SD	CV	Median	LCI	UCI	Kruskal.p	Kruskal.p
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>
Cyberdyne	failed	98	52.38	11.85	22.63	54.5	28.43	69.58	2.1908311072076e-13	3.262241e-05
Cyberdyne	survived	22	72	0	0	72	72	72	2.1908311072076e-13	3.262241e-05
Genisys	failed	113	50.33	12.16	24.17	52	26	68.2	0.214357234187896	3.262241e-05
Genisys	survived	67	49.79	14.08	28.27	41	41	72	0.214357234187896	3.262241e-05

group	group	N	Mean	SD	CV	Median	LCI	UCI	Kruskal.p	Kruskal.p
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>
Cyberdyne	failed	98	52.38	11.85	22.63	54.5	28.43	69.58	2.1908311072076e-13	0.2489534
Cyberdyne	survived	22	72	0	0	72	72	72	2.1908311072076e-13	0.2489534
Genisys	failed	153	51.3	11.16	21.76	53	26.8	68.2	1.11758939982617e-16	0.2489534
Genisys	survived	27	72	0	0	72	72	72	1.11758939982617e-16	0.2489534

group	N	Mean	SD	CV	Median	LCI	UCI	vs.Cyberdyne	vs.Genisys	Kruskal.p
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
Cyberdyne	120	55.98	13.14	23.48	57	28.98	72	1	0.305073242301337	0.248953411035855
Genisys	180	54.41	12.68	23.31	55	27	72	0.305073242301337	1	0.248953411035855

group	group	N	Mean	SD	CV	Median	LCI	UCI	Kruskal.p	Kruskal.p
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>
Cyberdyne	survived	22	72	0	0	72	72	72	2.1908311072076e-13	3.262241e-05
Cyberdyne	failed	98	52.38	11.85	22.63	54.5	28.43	69.58	2.1908311072076e-13	3.262241e-05
Genisys	survived	19	72	0	0	72	72	72	2.14418429936659e-17	3.262241e-05
Genisys	failed	113	50.33	12.16	24.17	52	26	68.2	2.14418429936659e-17	3.262241e-05
Genisys	drop-out	48	41	0	0	41	41	41	2.14418429936659e-17	3.262241e-05

Surviving survival analysis – guide