Analyzing and Predicting Life Expectancy

1. Introduction

Life expectancy (LE) is a statistical measure of how long a person or organism may live and is probably the most important measure of health. It is readily comparable across countries and asks the most fundamental question concerning health: how long can the typical person expect to live? Worldwide, the average life expectancy at birth was 71.0 years (68.5 years for males and 73.5 years for females) over the period 2010–2013 according to United Nations World Population Prospects 2012 Revision. On the country level, there are a lot of factors that would affect the life expectancy and could be considered as predictors. With environmental issues and healthcare catching an increasing amount of attention, this report is to analyze those underlying factors and try to predict the life expectancy range given those parameters.

2. Description of Data Used

The data used in this report comes from World Bank Group’s World Bank Open Data (http://data.worldbank.org/). The ordinary data set includes 214 countries, covering 50 variables from 2000 to 2013. Treating each Year as a separate entry, there are 214×14=2996 entries and 149800 cells (2996×50) in total. After cleaning up the datas and considering the incompleteness of certain variables, the final data set narrows down to 20 variables.

3. Basic Multiple Regression – 7 variables chosen

Using forward and backward method, based on cp, bic and adjusted r2, we look for the best prediction model for people’s life expectancy in a country.

The models selection result leads to a model with the following 7 variables.

3.1. Urban Population Percentage: 0.09266439

The percentage of the total population living in urban areas is a good measure of the degree of urbanization of a population. Urbanization is relevant to a range of disciplines, including geography, sociology, economics, urban planning, and public health. Generally speaking, people living in urban areas usually have more access to healthcare services and better infrastructure, which leads to higher LE. Although this is in line with our intuition, we might expect some change in the future as most rural areas have similar level of healthcare services and infrastructure.

3.2. Unemployment: -0.22491359

Higher level of unemployment indicates more people with no constant income, which will lead to worse living condition both physically and psychologically. Unemployment is also a broad indicator for a country’s economic conditions, which would affect many aspects of its citizens’ LE.

3.3. Health Expenditure Percentage of GDP: 0.61587790

Health expenditure percentage of GDP is another indicator directly related to healthcare services. We believe that it shows both its importance to a nation and a nation’s financial capability to provide such services. Although higher expenditure percentage not always reflects higher LE, it does represents a country’s focus in healthcare, compared with defense, education, among other.

3.4. Immunization DPT Percentage of Children: 0.12299744

DPT refers to a class of combination vaccines against three infectious diseases in humans: diphtheria, pertussis (whooping cough), and tetanus. A child is considered adequately immunized against these diseases after receiving three doses of vaccine. Therefore, higher percentage of immunization leads to lower percentage of being infected, which leads to higher LE. Thus, the positive relationship makes sense.

3.5. Access to Improved Sanitation Facilities Percentage: 0.17457739

Improved sanitation facilities are likely to ensure hygienic separation of human excreta from human contact. They include flush/pour flush (to piped sewer system, septic tank, pit latrine), ventilated improved pit (VIP) latrine, pit latrine with slab, and composting toilet. Therefore, it is essential to prevent low LE due to unsanitary living conditions.

3.6. Out-of-pocket Health Expenditure Percentage of Total Health Expenditure: 0.10956803

Total health expenditure is the sum of public and private health expenditure. Out-of-pocket is a major part of private health expenditure, showing the amount of money coming from individuals’ pockets. Intuitively, the lower the level of out-of-pocket health expenditure, the higher the country’s social welfare. But it is often the case that higher out-of-pocket percentage indicates that people are putting a great amount of attention/resource into healthcare, which shows their willingness to live a healthier life and better financial conditions. Also, where private healthcare sector is strong, the overall industry is usually better off with more vitality.

3.7. Internet Users per 100 People: 0.05198543

Percentage of Internet users is considered as an indicator of modernization. We believe it is an integrated indicator of a country’s infrastructure, technology and education. Therefore, a high level of Internet users per 100 people usually has a positive correlation with LE. However, we could also expect that at some point, heavier uses of internet might have a negative relationship with LE.

4. Predicting LE Compared with World Average (Classification)

As there is an increasing trend in LE versus Year, we could provide a more meaningful result by comparing the prediction with the world average LE.

specifically, we construct a new data set using the following steps:

  1. Find the variables that have a high correlation with Year (plots attached in Appendix D)
  2. Find the world average each year for each of these variables. These values are compiled in “Data_Life Expectancy3_calculated average.xlsx”
  3. Calculate the values to be the percentages compared with that year’s world average (e.g. in 2012, LE in United States is 1.12 times the 2012 world average)

We divide the LE into four categories:

  1. “lower than 90% of the world average”,
  2. “between 90% to 100% of the world average”,
  3. “between 100% and 110% of the world average”,
  4. “more than 110% of the world average”

[A difference of 10% seems small. But in this context, with an average LE of 70 years for example, there will be a difference of 14 years (that’s a lot for humans) between 90% and 110%. (64 vs. 78)]

Randomly choosing 1,000 entries as training data.

After comparing models (LDA, QDA, KNN, etc.), the QDA model with all variables has the best classifying power. It gives an AUC of 0.9715 on the test data.

Note: X-axis is in reverse scale, so (1- specificity) actually; should be false positive rate (FPR) to be clear

The models with the 7 variables chosen in Part 3 have AUCs of 0.8761 and 0.8025 for QDA and LDA respectively. KNN model has an AUC of 0.9165.

4. Conclusion and Attachments

By and large, life expectancy is higher in a country where level of modernization is higher, public health service (including infrastructure and immunization, etc.) is better, citizens’ financial capability is higher and overall education is better.

As an overarching measurement, LE and the factors behind it reflect a country’s economical, political, environmental conditions. It also shows where a country could do to improve its people’s LE. The model presented here also exemplifies an idea/measurement that could be used regionally, for local government or NGOs to predict and compare the outcomes of their works.

  • The R code could be downloaded here (run as rmd)
  • Dataset could be downloaded here (csv)
  • An original draft compiled in 2015 could be downloaded here