Statistics and Population (2024)

  • Home
  • Rates
  • Mortality
  • Nuptiality
  • Fertility
  • Projections
  • Projections

Home Rates Mortality Nuptiality Fertility Projections

The Lee-Carter Model

We apply the Lee-Carter approach to forecasting U.S. mortality, doingall calculations from first principles. This log is best read inconjunction with the class handout.

Lee and Carter use U.S. mortality rates for conventional 5-year agegroups up to 85+ by single calendar years from 1933 to 1987, for bothsexes.

The data can be obtained from mortality.org. Register for anaccount, then select U.S.A., navigate to the row for life tables, total,and the column for 5x1. Save the life table as a text filebltper_5x1.txt. This update uses the version last modifiedon 02 Jun 2022.

The code below shows the steps I followed after saving the web pageas bltper_5x1.txt. We need to code age from the grouplabels and, more importantly, compute the rate for the open-ended agegroup 85+. For fitting purposes we select ages up to 85 and the years upto 1987, but later use more ages and years.

  • Stata
  • R
. import delimit using bltper_5x1.txt, rowrange(3) varnames(3) ///> delimit(" ", collapse) clear(encoding automatically selected: ISO-8859-1)(11 vars, 2,112 obs). replace age = "110" if age == "110+"(88 real changes made). gen pos = strpos(age, "-"). replace age = substr(age, 1, pos - 1) if pos > 0(1,936 real changes made). rename age age_s. gen age = real(age_s) . save us1x5, replace // for later usefile us1x5.dta saved. replace mx = lx/tx if age == 85(88 real changes made). keep year age mx. keep if age <= 85 & year <= 1987(1,067 observations deleted)
> library(dplyr)> library(stringr)> library(tidyr)> us <- read.table("bltper_5x1.txt", skip = 2, header=TRUE)> us <- mutate(us, age0 = as.numeric(str_extract(us$Age, "[0-9]+")),+ age = ifelse(age0 >= 5, age0 + 2.5, ifelse(age0 > 0, 3, 0.5)))> us85 <- mutate(us, mx = ifelse(Age == "85-89", lx/Tx, mx))> usmx <- select(us85, Year, age, mx) |> filter(age < 90)

Next we compute the log of the rates. Thecalculations that follow are a bit simpler if we reshape thedata. We will store them in a matrix of 55 yearsby 19 age groups.

. gen logm = log(mx). drop mx. quietly reshape wide logm, i(age) j(year)
> rates <- filter(usmx, Year <= 1987)$mx> M <- matrix(log(rates), 55, 19, byrow = TRUE)

Fitting the Model

The first thing we need is the mean log-rate for each age group. Thisis easily computed using the rowmeansfunction in egen. colMeans(). The results are remarkablyclose to Table 1 in the original paper.

. egen a = rowmean(logm1933 - logm1987). list age a ┌──────────────────┐ │ age a │ ├──────────────────┤ 1. │ 0 -3.6419408 │ 2. │ 1 -6.6958293 │ 3. │ 5 -7.5135297 │ 4. │ 10 -7.5654938 │ 5. │ 15 -6.7576327 │ ├──────────────────┤ 6. │ 20 -6.4478058 │ 7. │ 25 -6.4054185 │ 8. │ 30 -6.2271491 │ 9. │ 35 -5.9070615 │ 10. │ 40 -5.5138606 │ ├──────────────────┤ 11. │ 45 -5.0874851 │ 12. │ 50 -4.652652 │ 13. │ 55 -4.2607583 │ 14. │ 60 -3.8570917 │ 15. │ 65 -3.4747763 │ ├──────────────────┤ 16. │ 70 -3.0590744 │ 17. │ 75 -2.6391786 │ 18. │ 80 -2.2175468 │ 19. │ 85 -1.6198283 │ └──────────────────┘
> a <- colMeans(M)> a [1] -3.641941 -6.695829 -7.513530 -7.565494 -6.757633 -6.447806 -6.405419 [8] -6.227149 -5.907062 -5.513861 -5.087485 -4.652652 -4.260758 -3.857092[15] -3.474776 -3.059074 -2.639179 -2.217547 -1.619828

The next step is to subtract the average age patterna from all years. To do this wecopy the data into Mata

. mata: a = st_data(1::19, "a"). mata: Y = st_data(1::19, 2..56). mata: Y = Y :- a
> for(j in 1:19) M[,j] <- M[,j] - a[j]

We are now ready to compute the Singular Value Decomposition (SVD),which writes M = U D V, whereU and V’ are orthogonal matrices andD is a diagonal matrix of singular values. Before we call the function we need to define the outputmatrices. In fact we need just the first left andright singular vectors. The first column of Utimes D1,1 times the first row ofV’ has the best rank-1 approximation to the inputmatrix.

. mata: U = d = Vt = J(0, 0, .). mata: svd(Y', U, d, Vt)
> d <- svd(M, 1, 1)

Lee and Carter normalize the first row of V so itsums to one and call it b. This vector models how thedifferent age groups react to mortality change.

. mata: b = Vt[1,]/sum(Vt[1,]). mata: b[1..5] // partial list ... 1 2 3 4 5 ┌───────────────────────────────────────────────────────────────────────┐ 1 │ .0911639354 .1121398614 .09379187 .0832670274 .0498104034 │ └───────────────────────────────────────────────────────────────────────┘
> b <- d$v/sum(d$v)> head(b, 5) [,1][1,] 0.09116394[2,] 0.11213986[3,] 0.09379187[4,] 0.08326703[5,] 0.04981040

These values are remarkably close to the b’s published inTable 2. Lee and Carter also take the first column ofU, multiply by D1,1 andmultiply by the sum of the first row of V’ (to cancelthe division) and call that k. This vector capturesoverall mortality change over time.

. mata: k = U[,1] * sum(Vt[1,]) * d[1]. mata: k[1..5]' // partial list... 1 2 3 4 5 ┌───────────────────────────────────────────────────────────────────────┐ 1 │ 11.39151719 11.86102892 11.35305831 11.64034529 10.85845282 │ └───────────────────────────────────────────────────────────────────────┘
> k <- d$u * sum(d$v) * d$d[1]> head(k, 5) [,1][1,] 11.39152[2,] 11.86103[3,] 11.35306[4,] 11.64035[5,] 10.85845

Plotting Parameters and Fits

The next task is to compute the Lee-Carter fits to the mortalityrates in 1933 and 1987, reproducing part of Figure 2 in the originalpaper. We are done with Mata, so we copyb and k into Stata.

. mata: st_store(1::19, st_addvar("float","b"), b'). set obs 55 Number of observations (_N) was 19, now 55.. mata: st_store(., st_addvar("float","k"), k). gen fit1933 = a + b * k[1](36 missing values generated). gen fit1987 = a + b * k[1987-1932](36 missing values generated). gen agem = age + 2.5(36 missing values generated). replace agem = 0.5 in 1(1 real change made). replace agem = 3 in 2(1 real change made). twoway (scatter logm1933 agem) (line fit1933 agem) ///> (scatter logm1987 agem) (line fit1987 agem) ///> , title(Lee-Carter fits for 1933 and 1987) xt(age) legend(off). graph export lcfig1.png, width(500) replacefile lcfig1.png saved as PNG format

Statistics and Population (1)

> library(ggplot2)> usmx2 <- filter(usmx, Year==1933 | Year == 1987) |> + mutate(year = factor(Year),+ fit = c(exp(a + b * k[1]), exp(a + b * k[55])))> ggplot(usmx2, aes(age, mx, color=year)) + geom_point() + + geom_line(aes(age,fit,color=year)) + scale_y_log10() ++ ggtitle("Lee-Carter Fits for 1933 and 1987")> ggsave("lcfig1r.png", width = 500/72, height = 400/72, dpi = 72)

Statistics and Population (2)

Here’s the trajectory of k

. gen t = 1932 + _n . line k t, title("Lee-Carter k for 1933-1987") xt(year). graph export lcfig2.png, width(500) replacefile lcfig2.png saved as PNG format

Statistics and Population (3)

> trend <- data.frame(year = 1933:1987, k = k)> ggplot(trend, aes(year, k)) + geom_line() ++ ggtitle("Lee-Carter k for 1933-1987")> ggsave("lcfig2r.png", width = 500/72, height = 400/72, dpi = 72) 

Statistics and Population (4)

A Random Walk

We now simulate the stochastic process driving k, arandom walk with drift -0.365 and innovations with standarddeviation 0.652. The idea is to generate normally distributederrors, add the drift, and them add to each value the previous one,which is easiest done as a running sum. We do this 50 times. For clarity I use a temporary variable callede for the errors

. set obs 61Number of observations (_N) was 55, now 61.. gen n = _n. gen e = .(61 missing values generated). forvalues r = 1/50 { 2. quietly replace e = rnormal() * 0.652 3. quietly gen k`r' = -11.05 - 0.365 + e in 1 4. quietly replace k`r' = k`r'[_n-1] - 0.365 + e in 2/L 5. }. twoway (line k1-k50 n), legend(off) xt(year) title(50 Random Walks). graph export lcfig3.png, width(500) replacefile lcfig3.png saved as PNG format

Statistics and Population (5)

> S <- matrix(0,61,50)> for(j in 1:50) S[,j] <- -11.05 + c*msum(-0.365 + rnorm(61, 0, 0.652))> df <- data.frame(S); df$year <- 1990:2050> ggplot(gather(df, sim, k, -year), aes(year, k, color = sim)) ++ geom_line() + guides(color = "none") + + ggtitle("Fifty Random Walks")> ggsave("lcfig3r.png", width = 500/72, height = 400/72, dpi = 72)

Statistics and Population (6)

This being a simulation your results will differ from mine. Youshould note that there is considerable uncertainty about the level ofmortality 50 years into the future.

Age-Specific Forecasts.

Let us forecast age-specific mortality in 2050 starting from 1989. Tothis end we will use the official values of a andb, which I saved in a text file calledleecarter.dat (also a Stata fileleecarter.dta) in the datasets section. These values extendthe two sets of parameters to age 105+, but are otherwise practicallyidentical to the estimates we just obtained.

Next we compute k starting from -11.05 in 1989. I call the resulting scalar km to avoidconflict with the variable k. We also compute thestandard deviation and a 95% confidence interval. Atthis point we restore the original data to get all ages

. use us1x5, clear. keep if age < 110(88 observations deleted). gen logm = log(mx). keep age year logm. merge m:1 age using https://grodri.github.io/datasets/leecarter Result Number of obs ───────────────────────────────────────── Not matched 0 Matched 2,024 (_merge==3) ─────────────────────────────────────────. gen agem = age + 2.5. quietly replace agem = 0.5 if age == 0. quietly replace agem = 3 if age == 1. scalar km = -11.05 - 0.365*(2050 - 1989). scalar sd = sqrt(2050 - 1989)*0.652. di km - 1.96*sd, km + 1.96*sd-43.295874 -23.334126. gen p2050 = lc_a + lc_b * km. gen lb2050 = lc_a + lc_b * (km - 1.96*sd). gen ub2050 = lc_a + lc_b * (km + 1.96*sd). twoway (rarea ub2050 lb2050 agem, color(ltblue)) (line p2050 agem) ///> , legend(off) title(Forecast for 2050 with 95% CI) xt(age). graph export lcfig4.png, width(500) replacefile lcfig4.png saved as PNG format

Statistics and Population (7)

> lc <- read.table("leecarter.dat", header = TRUE)> k2050 <- 33.3> z <- qnorm(0.975)> se <- sqrt(2050 - 1989) * 0.652> k2050 + c(-1, 1) * z * se[1] 23.31931 43.28069> forecast <- data.frame(age = lc$age,+ fit = exp(lc$a + lc$b * k2050),+ low = exp(lc$a + lc$b * (k2050 - z * se)),+ hi = exp(lc$a + lc$b * (k2050 + z * se)))> ggplot(forecast, aes(age, fit)) + scale_y_log10() ++ geom_ribbon(aes(ymin = low, ymax = hi), fill="#d0d0d0") ++ geom_line() + ggtitle("Figure 4: Forecast for 2050 with 95% CI")> ggsave("lcfig4r.png", width = 500/72, height = 400/72, dpi = 72)

Statistics and Population (8)

So we are 95% confident that k will be between -43.3 and-23.3 in 2050. Finally we combine k and the aand b schedules to forecast and plot age-specificmortality (in the log scale).

Corrected Jump-off

The final issue concerns the choice of jump-off. Below we will make a“forecast” for 2013, the latest year for which we have data. Theprediction is not too good for young adults because of a slight lack offit in 1989 which propagates in the forecast. The solution is to“jump-off” from the actual 1989 rates.

. keep if year == 1989 | year == 2013(1,978 observations deleted). drop _merge. reshape wide logm, i(age lc_a lc_b) j(year) (j = 1989 2013)Data Long -> Wide─────────────────────────────────────────────────────────────────────────────Number of observations 46 -> 23 Number of variables 9 -> 9 j variable (2 values) year -> (dropped)xij variables: logm -> logm1989 logm2013─────────────────────────────────────────────────────────────────────────────. scalar drift = -0.365 * (2013 - 1989). scalar se = 0.625 * sqrt(2013 - 1989). scalar z = invnormal(0.975). gen p2013 = lc_a + lc_b * (-11.05 + drift). gen j2013 = logm1989 + lc_b * ( drift). gen ub2013 = logm1989 + lc_b * ( drift - z * se). gen lb2013 = logm1989 + lc_b * ( drift + z * se). twoway (rarea ub2013 lb2013 agem, color(ltblue)) ///> (scatter logm2013 agem if age < 110) ///> (line p2013 agem, lpat(dash)) (line j2013 agem if age < 110) ///> , title(Observed and Forecast for 2013) xt(age) ///> legend(order(3 "standard" 4 "jumpoff 1989") ring(0) pos(5) cols(1)) . graph export lcfig5.png, width(500) replacefile lcfig5.png saved as PNG format

Statistics and Population (9)

> later <- select(us, age, Year, mx) |> filter(Year == 1989 | Year == 2013)> a89 <- log(filter(later, Year == 1989, age < 110)$mx)> k89 <- -(2013 - 1989) * 0.365> se <- sqrt(2013 - 1989) * 0.652> y2013 <- filter(later, Year == 2013, age < 110) |>+ mutate(standard = exp(lc$a - lc$b * 19.81),+ jumpof = exp(a89 + lc$b * k89),+ low = exp(a89 + lc$b * (k89 - z * se)),+ hi = exp(a89 + lc$b * (k89 + z * se)))> ggplot(y2013, aes(age, mx)) + scale_y_log10() ++ geom_ribbon(aes(ymin = low, ymax = hi), fill = "#d0d0d0") ++ geom_point() + geom_line(aes(age, standard)) + geom_line(aes(age, jumpof)) ++ ggtitle("Observed and Forecast for 2013")> ggsave("lcfig5r.png", width = 500/72, height = 400/72, dpi = 72)

Statistics and Population (10)

The performance of the model with the correct jump-off for a 24-yearforecast is nothing short of remarkable.

Update. Now that we have more data we can test themodel further. Turns out the model doesn’t do so well for later years,because there was an unexpected rise in adult mortality during theworking ages starting around 2010, not to mention the pandemic and thesubsequent increase in mortality in 2020. You may wish to explore theseissues using the available data.

Reference

HMD. Human Mortality Database. Max Planck Institute for DemographicResearch (Germany), University of California, Berkeley (USA), and FrenchInstitute for Demographic Studies (France). Available at https://www.mortality.org.

Statistics and Population (2024)

FAQs

How is statistics used in population? ›

Statistical populations are used to observe behaviors, trends, and patterns in the way individuals in a defined group interact with the world around them, allowing statisticians to draw conclusions about the characteristics of the subjects of study, although these subjects are most often humans, animals, and plants, ...

What is the difference between population and statistics? ›

When you collect data from a population or a sample, there are various measurements and numbers you can calculate from the data. A parameter is a measure that describes the whole population. A statistic is a measure that describes the sample.

What is the statistical study of population? ›

Demography is defined as the statistical study of human populations, especially with reference to size and density, distribution, and vital statistics (births, marriages, deaths, etc.).

What is population with example in statistics? ›

All the students in the class are population whereas the top 10 students in the class are the sample. All the members of the parliament is population and the female candidates present there is the sample.

Is population mean statistics? ›

Population Mean

The term population mean, which is the average score of the population on a given variable, is represented by: μ = ( Σ Xi ) / N. The symbol 'μ' represents the population mean.

What is statistical significance for a population? ›

A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values. Statistical significance is a determination that a relationship between two or more variables is caused by something other than chance.

Why is population statistics important? ›

Governments, for example, need to know how many inhabitants currently live in the country and how many there will be in the future. From this they can plan and make better decisions regarding infrastructure i.e. the construction of schools, hospitals, roads etc.

What are the two main branches of statistics? ›

Descriptive and Inferential Statistics

The two major areas of statistics are known as descriptive statistics, which describes the properties of sample and population data, and inferential statistics, which uses those properties to test hypotheses and draw conclusions.

What is a sample in statistics? ›

A sample in statistics is a small data set that you obtain from a larger set of data to represent a whole, for example, the entire population. Using samples is common among statisticians because it makes it easier to gather and analyse information.

What is the population in a statistical study? ›

Population is a statistical term that designates the pool from which a sample is drawn for a study. Any selection grouped by a common feature can be considered a population. A sample is a statistically significant portion of a population.

What are statistics from a population called? ›

Statistics such as averages and standard deviations, when taken from populations are referred to as population parameters. They are often denoted by Greek letters: the population mean is denoted by μ(mu) and the standard deviation denoted by ς (low case sigma)

What is study population in statistics? ›

The population of a study in research refers to the set of cases or individuals that will be the subjects for selecting the sample . It is a defined, limited, and accessible group that must fulfill specific characteristics and criteria.

What is the difference between a population and a statistic? ›

The population refers to the entire group of individuals or items that we are interested in studying and drawing conclusions about. In statistics, the population is the entire set of items from which data is drawn in the statistical study.

What are the two things that increase a population? ›

There are two things that increase a population size. Populations can increase as a result of birth rate and immigration. Natality is the ratio of the number of births to the size of the population.

What is a variable in statistics and population? ›

A variable is a characteristic or condition that can change or take on different values. Most research begins with a general question about the relationship between two variables for a specific group of individuals. The entire group of individuals is called the population.

What statistics are used to summarize a population? ›

Descriptive statistics are brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population.

Why is statistics important to people? ›

Statistics are important because they help people make informed decisions. Governments, organizations, and businesses all collect statistics to help them track progress, measure performance, analyze problems, and prioritize.

How do people use statistics? ›

They're used to conduct research, evaluate outcomes, develop critical thinking, and make informed decisions. Statistics can be used to inquire about almost any field of study to investigate why things happen, when they occur, and whether reoccurrence is predictable.

What type of statistics describe a population? ›

Descriptive statistics are applied to populations, and the properties of populations, like the mean or standard deviation, are called parameters as they represent the whole population (i.e., everybody you are interested in).

Top Articles
Latest Posts
Article information

Author: Dong Thiel

Last Updated:

Views: 5954

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.