---
title: "Analyzing and Predicting Life Expectancy"
author: "Shiqi She"
date: "December 18, 2015"
output: word_document
---

## Introduction
Life expectancy is a statistical measure of how long a person or organism may live and is probably the most important measure of health. It is readily comparable across countries and asks the most fundamental question concerning health: how long can the typical person expect to live? Worldwide, the average life expectancy at birth was 71.0 years (68.5 years for males and 73.5 years for females) over the period 2010–2013 according to United Nations World Population Prospects 2012 Revision. On the country level, there are a lot of factors that would affect the life expentancy and could be considered as predictors. With environmental issues and healthcare catching an increasing amount of attention, this report is to analyze those underlying factors and try to predict the life expectancy range given those parameters.

## Description of Data Used
The data used in this report comes from World Bank Group (http://www.worldbank.org/)'s World Bank Open Data (http://data.worldbank.org/). The ordinary data set includes 214 countries, covering 50 variables from 2000 to 2013. We will first treat Year as a variable, therefore we will have 214x14=2996 entries and each entry will have 51 variables. The description of the 50 variables is provided by World Bank Open Data and is attached "Definition and Source" sheet in Data_Life Expectancy2_raw.xlsx. However, among the 2996x50=149800 cells, nearly half (70256) are "..", which indicates n.a. value. Due to the high degree of incompleteness of some variables, we can not simply delete data entries that contain n.a. value, which will need to delete all! Therefore, we have to delete some variables with low completeness. The entire process is described in Appendix A, which gives us a data set with no null values (Data_Life Expectency2_no null.csv: 1779 entries and 20 variables including Country.Name and Year). 

## Part I Data Exploring
We will start by exploring the data set produced by Appendix A.
```{r}
data=read.csv("Data_Life Expectancy2_no null.csv", header=TRUE)
dim(data)
sum(is.na(data))
# change names of variables
# we will denote Life Expenctancy by LE for ther rest of this report
colnames(data)[2] <- "LE"
colnames(data)[6] <- "Urban.population.percentage"
colnames(data)[7] <- "Household.consumption.expenditure"
colnames(data)[9] <- "Forest.area.percentage"
colnames(data)[10] <- "Access.to.improved.water.source.percentage"
colnames(data)[12] <- "Arable.land.per.capita"
colnames(data)[13] <- "Health.expenditure.percentage.of.GDP"
colnames(data)[14] <- "Immunization.DPT.percentage.of.children"
colnames(data)[15] <- "Access.to.improved.sanitation.facilities.percentage"
colnames(data)[16] <- "Immunization.measles.percentage.of.children"
colnames(data)[17] <- "Out.of.pocket.health.expenditure.percentage.of.total.health.expenditure"
# summary(data) is attached in Appendix B
```

Here is the basic exploration of each variable and the plots are attached as Appendix C:

most countries have GDP.per.capita below $10000

most countries have Population.density below 500 people per sq km of land area

Urban.population.percentage looks like a normal distribution

most countries have total Household.consumption.expenditure below $500000

Unemployment looks like negatively related to LE

most countries have Forest.area.percentage less than 40%

most countries have Access.to.improved.water.source.percentage more than 95%

most countries have Food.production.index around 100

most countries have Arable.land.per.capita less than 0.5

Health.expenditure.percentage.of.GDP looks like positively related to LE

most countries have Immunization.DPT.percentage.of.children more than 90%

most countries have Access.to.improved.sanitation.facilities.percentage more than 90%

most countries have Immunization.measles.percentage.of.children more than 90%

Out.of.pocket.health.expenditure.percentage.of.total.health.expenditure is rather diperse, but looks like there is a little negative effects on LE

rather flat except high frequency for values lower than 5 for Fixed.telephone.subscriptions.per.100.people, Mobile.cellular.subscriptions.per.100.people and Internet.users.per.100.people

LE is growing slowly each year

## Part II Basic Multiple Regression
As mentioned above, our first analysis is to find a basic multple regression model, treating Year as a normal variable. The goal is to identify significatn factors and their relationship with LE.
```{r}
library(leaps)

# choose the appropriate model using forward method
fit.forward=regsubsets(LE~.-Country.Name, data, nvmax=20, method="forward")
f.f=summary(fit.forward)
par(mfrow=c(3,1))
plot(f.f$cp, xlab="Number of predictors", ylab="cp", col="red", type="p", pch=16)
plot(f.f$bic, xlab="Number of predictors", ylab="bic", col="blue", type="p", pch=16)
plot(f.f$adjr2, xlab="Number of predictors", ylab="adjr2", col="green", type="p", pch=16)
par(mfrow=c(1,1))

# choose the appropriate model using backward method
fit.backward=regsubsets(LE~.-Country.Name, data, nvmax=20, method="backward")
f.b=summary(fit.backward)
par(mfrow=c(3,1))
plot(f.b$cp, xlab="Number of predictors", ylab="cp", col="red", type="p", pch=16)
plot(f.b$bic, xlab="Number of predictors", ylab="bic", col="blue", type="p", pch=16)
plot(f.b$adjr2, xlab="Number of predictors", ylab="adjr2", col="green", type="p", pch=16)
par(mfrow=c(1,1))
coef(fit.forward, 7)
coef(fit.backward, 7)

# compare the two methods
plot(f.f$rsq, ylab="rsq", col="red", type="p", pch=16,
     xlab="Forward Selection")
lines(f.b$rsq, ylab="rsq", col="blue", type="p", pch=16,
      xlab="All Subset Selection")
```

Based on cp, bic and adjusted r2, we focus on the model with 7 variables, which we believe shows a good balance of sophisticatin and explanation power. Here is the detailed anaysis of the model:

#### Urban.population.percentage: 0.09266439
The percentage of the total population living in urban areas is a good measure of the degree of urbanization of a population. Urbanization is relevant to a range of disciplines, including geography, sociology, economics, urban planning, and public health. Generally speaking, people living in urban areas usually have more access to healthcare services and better infrastructure, which leads to higher LE. Although this is in line with our intuition, we might expect some change in the future as most rural areas have similiar level of healthcare services and infrastructure.

#### Unemployment: -0.22491359
As we have seen before, unemployment rate show negative coorelation with LE. This makes sense since higher level of unemployment indicates more people with no constant income, which will lead to worse living condiction both physically and pychologically. 

#### Health.expenditure.percentage.of.GDP: 0.61587790
Health expenditure percentage of GDP is another indicator directly related to healthcare services. We believe that it shows both its importance to a nation and a nation's financial capability to provide such services. Although higher expenditure not always reflects higher LE, a positive relationship is still expected here. 

#### Immunization.DPT.percentage.of.children: 0.12299744
DPT refers to a class of combination vaccines against three infectious diseases in humans: diphtheria, pertussis (whooping cough), and tetanus. A child is considered adequately immunized against these diseases after receiving three doses of vaccine. Therefore, higher perentage of immunization leads to lower percentage of being infected, which leads to higher LE. Thus, the positive relationship makes sense.

#### Access.to.improved.sanitation.facilities.percentage: 0.17457739
According to the description, improved sanitation facilities are likely to ensure hygienic separation of human excreta from human contact. They include flush/pour flush (to piped sewer system, septic tank, pit latrine), ventilated improved pit (VIP) latrine, pit latrine with slab, and composting toilet. Therefore, it is highly likely to improve the LE.

#### Out.of.pocket.health.expenditure.percentage.of.total.health.expenditure: 0.10956803
Total health expenditure is the sum of public and private health expenditure. Out-of-pocket is a major part of private health expenditure, showing the amount of money coming from individuals' pockets. Intuitively, the lower the level of out-of-pocket health expenditure, the higher the country's social wellfare and the lower the financial burden of healthcare on an individual, which will result in longer LE as expected. However, as this indicator is a just percentage, it is possible that a country with higher out-of-pocket health expenditure percentage still has a higher absolute value of health expenditure per capita.

#### Internet.users..per.100.people: 0.05198543
Percentage of Internet users is considered as a indicator of modernization. We believe it is an integrated indicator of a country's infrastruture, techonology and education. Therefore, a high level of Internet users per 100 people usually has a positive coorelation with LE.
    
## Part III Predicting the LE Compared with World Average
As we have seen in Part I, there is an increasing trend in LE versus Year. Thus, we could provide a more meaningful result by comparing the prediction with the world average.

To be more speficic, we construct a new data set using the following steps:

1.Find the variables that have a high correlation with Year (plots attached in Appendix D)

2.Find the world average each year for each of these variables (e.g. world average of LE in 2000 is 81.32978723). These values are compiled in "Data_Life Expectancy3_calculated average.xlsx"

3.Modify the values to be the percentage compared with that year's world average (e.g. in 2012, United States is 1.12 times 2012 world average of LE)
    
Therefore, we have the new data set called "Data_Life Expectancy2_no null_percentage of average.csv"

Then, we could read the date, change some colomns' names and separate the data into train data and test data (randomly select 1000 entries as train data).

```{r}
data1=read.csv("Data_Life Expectancy2_no null_percentage of average.csv", header=TRUE)
data2=data1[,-c(2, 5, 12, 17, 19, 21, 24, 26, 28)]
colnames(data2)[2] <- "LE"
colnames(data2)[4] <- "GDP.per.capita"
colnames(data2)[6] <- "Urban.population.percentage"
colnames(data2)[7] <- "Household.consumption.expenditure"
colnames(data2)[9] <- "Forest.area.percentage"
colnames(data2)[10] <- "Access.to.improved.water.source.percentage"
colnames(data2)[12] <- "Arable.land.per.capita"
colnames(data2)[13] <- "Health.expenditure.percentage.of.GDP"
colnames(data2)[14] <- "Immunization.DPT.percentage.of.children"
colnames(data2)[15] <- "Access.to.improved.sanitation.facilities.percentage"
colnames(data2)[16] <- "Immunization.measles.percentage.of.children"
colnames(data2)[17] <- "Out.of.pocket.health.expenditure.percentage.of.total.health.expenditure"
colnames(data2)[18] <- "Fixed.telephone.subscriptions.per.100.people"
colnames(data2)[19] <- "Mobile.cellular.subscriptions.per.100.people"
colnames(data2)[20] <- "Internet.users.per.100.people"

N=length(data2$LE)
set.seed(10)  # set a random seed so that we will be able to reproduce the random sample
index.train=sample(N, 1000) # Take a random sample of n=1000 from 1 to N=1779
data2.train=data2[index.train,] # Split the 1000 randomly chosen subjects as a training data
data2.test=data2[-index.train,] # The remaining subjects will be reserved for testing
```
  We also divide the LE into four categories: 
"lower than 90% of the world average", 
"between 90% to 100% of the world average", 
"between 100% and 110% of the world average", 
"more than 110% of the world average"
```{r}
data2.train$LE=cut(data2.train$LE, br = c(0, 0.9, 1, 1.1, max(data2.train$LE)))
summary(data2.train$LE)
data2.train$LE=as.factor(data2.train$LE)
levels(data2.train$LE) <- c("lower than 90% of the world average", "between 90% to 100% of the world average", "between 100% and 110% of the world average", "more than 110% of the world average")

data2.test$LE=cut(data2.test$LE, br = c(0, 0.9, 1, 1.1, max(data2.test$LE)))
summary(data2.test$LE)
data2.test$LE=as.factor(data2.test$LE)
levels(data2.test$LE) <- c("lower than 90% of the world average", "between 90% to 100% of the world average", "between 100% and 110% of the world average", "more than 110% of the world average")
```

We then build different classifiers and try to find the best one. Selected models are presented here:
```{r}
library(MASS)
library(pROC)
# The LDA model with variables selected in Part II
lda.fit1=lda(LE~Urban.population.percentage+Unemployment+Health.expenditure.percentage.of.GDP+Immunization.DPT.percentage.of.children+Access.to.improved.sanitation.facilities.percentage+Out.of.pocket.health.expenditure.percentage.of.total.health.expenditure+Internet.users.per.100.people, data2.train) 
lda.fit.predict1=predict(lda.fit1, data2.test)
lda.erro1=sum(data2.test$LE != lda.fit.predict1$class)/length(data2.test$LE)
lda.erro1
lda.fit.roc1=roc(data2.test$LE, lda.fit.predict1$posterior[, 2], plot=T)
auc(lda.fit.roc1)

# The QDA model with variables selected in Part II
qda.fit1=qda(LE~Urban.population.percentage+Unemployment+Health.expenditure.percentage.of.GDP+Immunization.DPT.percentage.of.children+Access.to.improved.sanitation.facilities.percentage+Out.of.pocket.health.expenditure.percentage.of.total.health.expenditure+Internet.users.per.100.people, data2.train) 
qda.fit.predict1=predict(qda.fit1, data2.test)
qda.erro1=sum(data2.test$LE != qda.fit.predict1$class)/length(data2.test$LE)
qda.erro1
qda.fit.roc1=roc(data2.test$LE, qda.fit.predict1$posterior[, 2], plot=T)
auc(qda.fit.roc1)

# The full LDA model
lda.fit=lda(LE~.-Country.Name-Year, data2.train)
lda.fit.predict=predict(lda.fit, data2.test)
lda.erro=sum(data2.test$LE != lda.fit.predict$class)/length(data2.test$LE)
lda.erro
lda.fit.roc=roc(data2.test$LE, lda.fit.predict$posterior[, 2], plot=T)
auc(lda.fit.roc)

# The full QDA model
qda.fit=qda(LE~.-Country.Name-Year, data2.train)
qda.fit.predict=predict(qda.fit, data2.test)
qda.erro=sum(data2.test$LE != qda.fit.predict$class)/length(data2.test$LE)
qda.erro
qda.fit.roc=roc(data2.test$LE, qda.fit.predict$posterior[, 2], plot=T)
auc(qda.fit.roc)

# The KNN model
library(class) 
knn.pred=knn(scale(data2.train[, -c(1,2,3)]), scale(data2.test[, -c(1,2,3)]), data2.train[, 2], k=30, prob=TRUE) 
knn.testing.erro=sum(data2.test$LE != knn.pred)/length(data2.test$LE)
knn.testing.erro
posterior=attributes(knn.pred)$prob
knn.fit.roc=roc(data2.test$LE, posterior, plot=T)
auc(knn.fit.roc)
```
Therefore, we choose the qda model with all variables, which gives us auc of 0.9715 and error rate of 0.1206675 on the test data. The QDA models generally provide better prediction than LDA models due to their flexibility. QDA also performs better than KNN, which has a even higher flexibility, since it assumes a quadratic desision boundry given the limited number of obervations. And the models with variables selected in Part II have auc of 0.8761 and 0.8025 for qda and lda respectively, which are pretty good results as well. Therefore, we believe the variables described above do have high predicting power.

## Part IV Conclusion
In this report, we selected 50 variables from World Bank Open Data and examined 20 of them closely. In the multple regression model, we found that Urban Population Percentage, Unemployment Rate, Health Expenditure Percentage of GDP, Immunization DPT Percentage of Children, Access to Improved Sanitation Facilities Percentage, Out-of-pocket Health Expenditure Percentage of Total Health Expenditure and Internet Users per 100 People are significatnt factors that affect Life Expectancy. When we are predicting a country's life expectancy compared with the world average, we found that a qda model with all variables gives us the highest auc. And 7 variables selected before are also very useful in predicting, giving us an auc of 0.8761.

Generally speaking, life expectancy is higher in a country where level of modernization is higher, public health service (including infrastructure and immunication, etc.) is better, citizens' financial capibility is higher and overall education is better.


## Appendix A - Procedure to Delete N.A balues
Start with 14*214 entries   50 variables    70256 empty values

delete entries with 40 or more empty values   2837 entries left   delete variables with more than 2500 na values

delete entries without life expectancy            delete entries with 30 or more empty values   2641 entries left

delete variables with more than 2000 na values    delete entries with 18 or more empty values   2532 entries left

delete variables with more than 1700 na values    delete entries with 14 or more empty values   2484 entries left

delete variables with more than 1000 na values    delete entries with 5 or more empty values    1924 entries left

delete variables with more than 100 na values     delete entries with empty values              1779 entries left

## Appendix B Summary of Data
```{r}
summary(data)
```

## Appendix C Plots of Data
```{r}
# most countries have GDP.per.capita below $10000
hist(data$GDP.per.capita, breaks=10)
# most countries have Population.density below 500 people per sq km of land area
hist(data$Population.density, breaks=4)
# Urban.population.percentage looks like a normal distribution
hist(data$Urban.population.percentage, breaks=50)
# most countries have total Household.consumption.expenditure below $500000
hist(data$Household.consumption.expenditure, breaks=20)
# Unemployment looks like negatively related to LE
plot(data$LE, data$Unemployment)
# most countries have Forest.area.percentage less than 40%
hist(data$Forest.area.percentage, breaks=10)
# most countries have Access.to.improved.water.source.percentage more than 95%
hist(data$Access.to.improved.water.source.percentage, breaks=10)
# most countries have Food.production.index around 100
hist(data$Food.production.index, breaks=30)
# most countries have Arable.land.per.capita less than 0.5
hist(data$Arable.land.per.capita, breaks=20)
# Health.expenditure.percentage.of.GDP looks like positively related to LE
plot(data$LE, data$Health.expenditure.percentage.of.GDP)
# most countries have Immunization.DPT.percentage.of.children more than 90%
hist(data$Immunization.DPT.percentage.of.children, breaks=20)
# most countries have Access.to.improved.sanitation.facilities.percentage more than 90%
hist(data$Access.to.improved.sanitation.facilities.percentage, breaks=20)
# most countries have Immunization.measles.percentage.of.children more than 90%
hist(data$Immunization.measles.percentage.of.children, breaks=20)
# rather diperse, but looks like there is a little negative effects on LE
plot(data$LE, data$Out.of.pocket.health.expenditure.percentage.of.total.health.expenditure)
# rather flat except high frequency for values lwoer than 5
par(mfrow=c(3,1))
hist(data$Fixed.telephone.subscriptions..per.100.people., breaks=20)
hist(data$Mobile.cellular.subscriptions..per.100.people., breaks=20)
hist(data$Internet.users..per.100.people., breaks=20)
par(mfrow=c(1,1))

# LE is growing slowly each year
boxplot(data$LE~data$Year)
```

## Appendix D Plots of Variables with High Coorelation versus Year
```{r}
plot(data$Year, data$GDP.per.capita)
plot(data$Year, data$Access.to.improved.water.source.percentage)
plot(data$Year, data$Immunization.measles.percentage.of.children)
plot(data$Year, data$Access.to.improved.sanitation.facilities.percentage)
plot(data$Year, data$Immunization.DPT.percentage.of.children)
plot(data$Year, data$Internet.users..per.100.people.)
plot(data$Year, data$Mobile.cellular.subscriptions..per.100.people.)
plot(data$Year, data$Fixed.telephone.subscriptions..per.100.people.)
```