In my first article on Survival Analysis, I introduced the concept of a survival function and how we can use this to compare different demographics in the case of churn. Here the mode of transport was used to compare churn rates for employees at a fictitious company. It was evident that cyclists were much more likely to “survive” (at least risk of turnover) compared to their peers who drove by car. An important point that was omitted from the previous article, however, and is the subject of this article, is the notion of censoring in Survival Analysis.
The notion of censoring is fundamental to survival analysis and is used when computing our survival functions (more on that in the next part of the series). But what do I mean by censoring? Strictly speaking, censoring is a condition when only part of the observation or measurement is known. That is the ability to take into account missing data, whereby the time to event is not observed.
For example, death in office of a president, or someone leaving a medical study before the study formally concludes. In the case of the latter, you can see this is really important for the analysis in medical trials, but in both cases the underlying principle is the same – we made some observations until a given time, but we cannot measure the event. If a president dies after one year in office, how can we possibly know that they would have served two terms?
There are different types of censoring, two commonly discussed ones are left and right censoring (two others that come to mind are interval censoring and random censoring, but are not discussed here).
You can think of this as events that happened to the left of time (in the past) are left censored, and events that may happen to the right of time (in the future) are right censored.
In the case of turnover, we are only considering right censoring, where a person may leave at some point in the future, but we don’t know when they will (if at all). Hopefully, the diagram below will help demonstrate this.
In the above example, we have 10 subjects in a medical study that begins at time t=0 and ends at time t=20 (don’t worry about units in this example, you can imagine weeks, months, years if it helps). Each subject is recorded until either the event happens (circle) or the end of the study is reached (the black vertical line at t=20).
As you can see we observe the event during the study for the red subjects, and the blue lines represent participants that no event occurred during the study period. Notice that some of the blue lines do end before the current time but occur after the end of the study period, and this is the critical thing, they are right-censored! If we did not include this into our analysis we would be underestimating the true average for our subjects.
Another example, which is much more fitting in today’s climate, is one that concerns virus testing. Let us imagine that some proportion of the population has been exposed to a virus and individuals are tested at a given point in time to see whether they have the virus or not. We will assume that these tests are
Now we can say that people with a positive test have been exposed to the virus at some point leading up to the test, but we don’t know exactly when they contracted the virus. Therefore, they are left-censored, since the event is when the individual contracted the virus, not the positive test. Similarly, anyone who tests negative is right-censored. In this rather unique case, our dataset is filled with only left and right censored cases, we actually never observe the event directly and only have lower and upper bounds for individuals’ time of contracting the virus.
In reality, the situation becomes even more complex given the testing accuracy, and we would need to consider interval censoring as well.
Conversely, if we define the event as the positive test then we have no left censoring and have the case as described previously, observing the event and all negative tests are then right censored, as shown in the diagram below.
As much as virus outbreaks and clinical trials are relevant today, it is not immediately clear how this translates to turnover. In actuality it is basically the same process except for a few things.
Let me illustrate this with an equivalent example as the medical study but this time for turnover.
This diagram represents absolute time (the date and year the company was founded until the present day). Employees join and leave the company at different times (which is how it works in reality).
We can see that the company has just reached its 20 year anniversary but unfortunately one of the founders left a year before this landmark (top red line). We also notice that other employees come and go during the 20 years with some new starters still present at the company. The company, therefore, started with 3 employees (founders), currently has 5 employees still present, and has had a total of 9 over its entire history (not the biggest company in the world). The maximum possible tenure is held by two of the founders at 20 years.
In this example, anyone still present at the company at their 20-year anniversary party has been right-censored, while people who left before (and missed a great party) are not censored and we observe the “death” event.
Now, we can translate the previous diagram, which is measured in absolute time, into a slightly different representation, instead taking our time measure as relative – which we understand as tenure.
Using tenure instead of company age as our x-axis then allows us to apply survival analysis to turnover whilst taking into account right censoring!
That concludes part 2 of the series on survival analysis. I hope by now I am starting to convince you that this old school method is not too bad 😉
With a PhD in Particle Physics and a proven track record of delivering physics and maths-based applications, Thomas is a strong engineering professional with a mission to bring data into the world of HR.
Are people who drive to work more likely to leave compared to people who cycle? Survival analysis can help you answer this question and understand employee churn.
HR is experiencing a tech boom, but we can still learn a lot from disciplines such as politics, marketing, and fusion. In a chat with PAFOW founder Al Adamsen, data scientist Thomas explains how.
Employee turnover has a big impact on productivity, morale, and brand image. Here is how we helped a major player in the logistics industry tackle this with people analytics.