Bolin Wu

Causal Inference 2: Propensity Score and Matching

2021-05-05 · 20 min read
R causal inference

In the Part 1 we talked about the basic concepts of causal effect and confounding. In this post we will proceeed with discussing about how to control the confounders with matching.

The main content consists of:

  • The concept of matching.
  • The concept of propensity score.
  • Demonstrate how to match on propensity score in R.
  • Analyze the outcome after matching.


Randomized trials

In a randomized trial, treatment assignment A would be determined by a coin toss, in order to erase the effect of X to A. It is a good way to get rid of confounding. However, the researchers may not always use randomized trials for the following reasons:

  • Randomized trials are expensive.
  • Sometimes randomizing treatment/exposure is unethical.
  • Some people may refuse to participate in trials.
  • Randomized trials take time.
    And that brings us the observational studies.

Observational studies are planned, prospective, observational studies with active data collection:

  • Like trials: data collected on a common set of variables at planned times; outcomes carefully measured; study protocols.
  • Unlike trials: regulations much weaker, since not intervening; broader population eligible for the study.

What is matching?

Matching is a method that attempts to make an observational study more like a randomized trial.
Main idea: Match individuals in the treated group A = 1 to individuals in the control group A = 0 on the covariates X.
Advantages of matching:

  • Controlling for confounders is achieved at the design phase.
  • Matching will reveal lack of overlap in covariate distribution. The positivity assumption will hold in the population that can be matched.
  • Once data are matched, essentially treated as if from a randomized trial.

Fine balance

Sometimes it is difficult to find great matches. We may be willing to accept some non-ideal matches if treated and control groups have same distribution of covariates. For example:

  • Match 1: treated, male, age 40 and control, female, age 45
  • Match 2: treated, female, age 45 and control, male, age 40

We can see that the neighter match 1 nor match 2 is a good match. However, if we look at them together, we will find that the average age and percent of male/female are the same. Therefore it is a fine balance. There is a lot of software that can impose fine balance.

How to match?

Since in real data, we typically can not match exactly, we first need to choose some metric of closeness. There are two common options:

  1. Mahalanobis distance
  2. Robust Mahalanobis distance

Mahalanobis distance is defined as follows.
Denote the value of a vector of covariates for subject j by XjX_{j} and the covariance matrix by S. D(Xi,Xj)=(XiXj)TS1(XiXj)D(X_{i}, X_{j}) = \sqrt{(X_{i} - X_{j})^{T}S^{-1}(X_{i} - X_{j})}

Robust Mahalanobis distance is aiming at solving the outliers problem. Outliers can creat large distances between subjects, even if the covariates are otherwise similar. Its application is as follows.

  • Replace each covariate value with its rank. For example, we set the rank of the youngest age as 1, the second youngest as 2, etc. So the difference between the youngest and the second youngest is only 1.
  • Constant diagonal on covariance matrix. Now ranks should all be on the same scale now, so that we do not want to weigh one variable more than the other.
  • Calculate the usual Mahalanobis distance on the ranks.

Greedy (nearest-neighbor) matching


  1. Randomly order list of treated subjects and control subjects.
  2. Start with the first treated subject. Match to the control with the smallest distance. This is called greedy.
  3. Remove the matched control from the list of available matches.
  4. Move on to the next treated subject. Match to the control with the smallest distance.
  5. Repeat steps 3 and 4 until you have matched all treated subjects.
    In R, this can be done by R package "MatchIt".


We might prefer to exclude treated subject for whom there does not exist a good match. A bad match can be defined by using a caliper --- maximum acceptable distance. Main idea:

  • Only matcha treated subject if the best match has distance less than the caliper.
  • If no matches within caliper, it is a sign that positivity assumption would be violated. Excluding these subjects makes assumption more realistic.

Optimal matching

Why greedy matching is not optimal? Greedy matching does not lead to the smallest total distance.

The optimal matching minizes global distance measure. It is conputationally demanding. Whether it is feasible to perform depends on the size of the problem, which is measured by the number of posdsible treatment-control pairings. R package "optmatch" and "rcbalance" can do the work.

Consatraints can be imposed to make optimal matching computationally feasible for larger data sets. And that brings us to the sparse optimal matching.

Sparce optimal matching

  1. We could match treated and control subjects within primary disease category.
  2. Or we focus on achieving optimal matching withing a subgroup of the treated or control subjects.
  3. Also, we can tolarate mismatches as long as fine balance can still be achieved.

Assessing balance

After we have matched, we should access whether matching seemed to work. We can look at a couple of things.

  • Covariate balance
    • standardied differemces. smd=XˉtreatmentXˉcontrol(streatment2+scontrol2)/2smd = \frac{\bar{X}_{treatment} - \bar{X}_{control}}{\sqrt{(s_{treatment}^{2} + s_{control}^{2}) / 2 }}
    • Wheter the means between the two groups are similar?
  • Hypothesis tests: Test for a difference in means between treated and controls for each convariate.
    • Two sample t-tests
    • Report p-value for each test. Please note that the p-value is dependent on the sample size.

Analyzing data after matching

After successfully matching and achieving adequate balance, proceed with outcome analysis.

Randomization tests

The main idea is as follows:

  1. Compute test statistics from observed data
  2. Assume null hypothesis of no treatment effect is true
  3. Randomly permute treatment assignment within pairs and recompute test statistic.
  4. Repeat many times and see how unusual observed statistic is.

This procedure is equivalent to the McNemar test for paried data or paired t-test for continuous data.

Sensitivity analysis

The previous mentioned matching aims to achieve balance on observed covariates. If there was imbalance on observed covariates, we call it overt bias. However, if there was imbalance on the variables that we did not match on (including unobserved variables), then we have hidden bias. Then the ignorability assumption violated. The main idea of sensitivity analysis is that if there is hidden bias, determine how severe it would have to be change conclusions.

Let πj\pi_{j} be the probability that person j receives treatment. πk\pi_{k} be be the probability that person k receives treatment. Suppose person j and k are perfectly matched, so that their observed covariates XjX_{j} and XkX_{k} are the same. If πj\pi_{j} = πk\pi_{k}, then there is no hidden bias. Consider the follow following inequality 1Γπj/(1πj)πk/(1πk)Γ\frac{1}{\Gamma} \leq \frac{\pi_{j}/(1 - \pi_{j})} {\pi_{k} / (1 - \pi_{k})} \leq \Gamma. Γ\Gamma is the odds ratio. If Γ\Gamma = 1, then no overt bias. If Γ\Gamma >1 implies hidden bias.

Propensity score

The propensity score is the probability of receiving treatment, rather than control, given covariates X. If we define A =1 for treatment and A = 0 for control. We will denote the propensity score for subject i by πi\pi_{i}.

πi=P(A=1Xi)\pi_{i} = P(A = 1|X_{i})

If a person i has a propensity score value of 0.3, that means, given their particular covariate values, there is a 30% chance they will be treated.

Balancing score

Suppose two subjects have the same value of the propensity score, but they possibly have different covariate values of X. Despite the different covariate values, they were both equally likely to have been treated. This means that both subjects X is just as likely to be found in the treatment group. If you restrict to a subpopulation of subjects who have the same value of the propensity score, there sould be balance in the two treatment groups. Then the propensity score is a balancing score. More formally,

P(X=xπ(X)=p,A=1)=P(X=xπ(X)=p,A=0)P(X=x|\pi(X) = p, A = 1) = P(X=x|\pi(X) = p, A = 0)

Implication: if we match on the propensity score, we should achieve balance. Another way of thinking this is that, considering we assumed ignorability - that treatment is randomized given X. Then conditioning on the propensity score is the same as conditioning on an allocation probability.

Estimated propensity score

In a randomized trial, the propensity score is generally known. However, in an observational study, it will be unknown. Notice that the propensity score involves observed data: A and X. Therefore we can estimate it. Typically when people talk about a propensity score, they are referring to the estimated propensity score. We need to estimate $$P(A=1|X)$$ with outcome to be A. If A is binary, we can estimate the propensity score by using logistic regression.

Propensity score matching

Trimming tails

If there is a lack of overlap, trimming the tails is an option. This means removing subjects who have extreme values of the propensity score. For example, removing control subjects whose propensity score is less than the minimum in the treatment group. Trimming can make the positivity assumption more reasonable.


We can proceed by computing a distance between the propensity score for each treated subject with every control. Then we use the nearest neighbor or optimal matching as before. In practice, logit of the propensity score is often used, rather than the propensity score itself. The reason is that the propensity score is bounded between 0 and 1, making many values similar. However, the logit of the propensity score is unbounded. This transformation essentially stretches the distribution, while preserving ranks.

To ensure that we do not accept any bad matches, a caliper can be used. In practice, a common choice for a caliper is the 0.2 times the standard deviation of logit of the propensity score. The procedure will be like as follows:

  1. Estiamte the propensity score (e.g. using logistic regression).
  2. Logit-transform the propensity score.
  3. Take the standard deviation of this transformed variable.
  4. Set the caliper to 0.2 times the value from step 3.

Please note that the 0.2 can be set to an arbitrary number. This propensity score matching can be done by R package MatchIt.

Analyze data in R using propensity score matching

we will use data from Lalonde (1986), that aimed to evaluate the impact of National Supported Work (NSW) Demonstration, which is a labor training program, on post-intervention income levels. Interest is in estimating the causal effect of this training program on income.

Load package and data

First load the packages tableone and Matching:

# install.packages("tableone")
# install.packages("Matching")
# install.packages("MatchIt")


The data have n=614 subjects and 9 variables:

  • age: age in years.
  • educ: years of schooling.
  • race: indicator variable for races.
  • married: indicator variable for marital status.
  • nodegree: indicator variable for high school diploma.
  • re74: real earnings in 1974.
  • re75: real earnings in 1975.
  • re78 real earnings in 1978.
  • treat: an indicator variable for treatment status.

The outcome is re78 – post intervention income. The treatment is treat – which is equal to 1 if the subject received the labor training and equal to 0 otherwise.

The potential confounding variables are: age, educ, black, hispan, married, nodegree, re74, re75.

Pre-matching examination

Before mathching, let us find the standardized differences for all of the confounding variables.

xvars = colnames(lalonde %>% select(-c("treat","re78")))
tabel1 = CreateTableOne(vars = xvars, strata = "treat", data = lalonde, test = F)
print(tabel1,smd = T)

                      Stratified by treat
                       0                 1                 SMD   
  n                        429               185                 
  age (mean (SD))        28.03 (10.79)     25.82 (7.16)     0.242
  educ (mean (SD))       10.24 (2.86)      10.35 (2.01)     0.045
  race (%)                                                  1.701
     black                  87 (20.3)        156 (84.3)          
     hispan                 61 (14.2)         11 ( 5.9)          
     white                 281 (65.5)         18 ( 9.7)          
  married (mean (SD))     0.51 (0.50)       0.19 (0.39)     0.719
  nodegree (mean (SD))    0.60 (0.49)       0.71 (0.46)     0.235
  re74 (mean (SD))     5619.24 (6788.75) 2095.57 (4886.62)  0.596
  re75 (mean (SD))     2466.48 (3292.00) 1532.06 (3219.25)  0.287

Q1. What is the standardized difference for married (to nearest hundredth)?
Answer: From the printed table above we can see that the standardized difference for married is nearly 0.719.

Q2. What is the raw (unadjusted) mean of real earnings in 1978 for treated subjects minus the mean of real earnings in 1978 for untreated subjects?

re78_mean = lalonde %>% group_by(treat) %>% summarise_at("re78", .funs = mean)
(re78_mean %>% filter(treat ==1) - re78_mean %>% filter(treat ==0))$re78
# The difference is -635

Matching on propensity score (without caliper)

Now let us estimated a propensity score model by using the logistic regression mdoel, where the outcome is treatment. Obtain the propensity score for each subject.

psmodel = glm(treat~.-re78, family = binomial(),data = lalonde)
# find the propneisty scrore
pscore = psmodel$fitted.values

Carry out propensity score by using the Match function

psmatch = Match(Tr = lalonde$treat, M = 1, X = pscore, replace = F)
matched = lalonde[unlist(psmatch[c("index.treated","index.control")]),]
matchedtab1 = CreateTableOne(vars = xvars, strata = "treat", data = matched, test = F)
print(matchedtab1,smd = T)

                      Stratified by treat
                       0                 1                 SMD   
  n                        185               185                 
  age (mean (SD))        25.22 (10.57)     25.82 (7.16)     0.066
  educ (mean (SD))       10.54 (2.72)      10.35 (2.01)     0.079
  race (%)                                                  0.855
     black                  87 (47.0)        156 (84.3)          
     hispan                 41 (22.2)         11 ( 5.9)          
     white                  57 (30.8)         18 ( 9.7)          
  married (mean (SD))     0.21 (0.41)       0.19 (0.39)     0.041
  nodegree (mean (SD))    0.64 (0.48)       0.71 (0.46)     0.139
  re74 (mean (SD))     2383.67 (4287.87) 2095.57 (4886.62)  0.063
  re75 (mean (SD))     1651.45 (2671.39) 1532.06 (3219.25)  0.040

From the column of SMD, we can see that the values are smaller. This is exactly what we want to see. Usually a boudnary value is 0.1. We would pay attention to the variates with SMD > 0.1. If it is too big, then maybe there is imbalance.

Q3. What is the standardized difference for married?
Answer: It is reduced to 0.041. We end up with better matching.

We can also try carrying out the propensity score matching using the matchit function. And with this function we can easily find the propensity overlapping. The default measure is logistic regression propensity scores.

# use matchit for propensity score, nearest neighbor matching
# the default measure is logistic regression propensity scores.
m.out = matchit(treat~. - re78, data = lalonde, method = "nearest")

# propensity score plots
plot(m.out, type = "hist")
plot(m.out, type = "jitter")

Check overlap by jitter plot

Check overlap by histogram

Matching on propensity score (with caliper)

Re-do the matching and set the caliper = 0.1.

# re-do matching using a caliper
psmatch = Match(Tr = lalonde$treat, M = 1, X = pscore, replace = F,caliper = 0.1)
matched = lalonde[unlist(psmatch[c("index.treated","index.control")]),]
matchedtab1 = CreateTableOne(vars = xvars, strata = "treat", data = matched, test = F)
print(matchedtab1,smd = T)

                      Stratified by treat
                       0                 1                 SMD   
  n                        111               111                 
  age (mean (SD))        26.46 (11.39)     26.22 (7.18)     0.026
  educ (mean (SD))       10.39 (2.70)      10.25 (2.31)     0.054
  race (%)                                                  0.048
     black                  80 (72.1)         82 (73.9)          
     hispan                 11 ( 9.9)         11 ( 9.9)          
     white                  20 (18.0)         18 (16.2)          
  married (mean (SD))     0.22 (0.41)       0.24 (0.43)     0.064
  nodegree (mean (SD))    0.63 (0.48)       0.65 (0.48)     0.037
  re74 (mean (SD))     2545.63 (4347.60) 2250.49 (5746.14)  0.058
  re75 (mean (SD))     1740.06 (2725.04) 1222.25 (3081.19)  0.178

Q4. For the matched data, what is the mean of real earnings in 1978 for treated subjects minus the mean of real earnings in 1978 for untreated subjects?

y_trt = matched$re78[matched$treat==1]
y_con = matched$re78[matched$treat==0]
mean(y_trt) - mean(y_con)
[1] 893.4831

The difference is around 893.

Q5. Carry out a paired t-test for the effect of treatment on earnings. What are the values of the 95% confidence interval?

diffy = y_trt - y_con

# paired t-test

One Sample t-test

data:  diffy
t = 1.0297, df = 110, p-value = 0.3054
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -826.1391 2613.1052
sample estimates:
mean of x

From the result we can not reject the null that there is no effect of treatment on earnings.


In this post we have introduced the powerful method of controling confounders, matching on propensity score. We also have demonstrated how to implement the analysis in R. In general, propensity score matching involves the following steps:

  1. Estimate propensity score;
  2. Match on propensity score;
  3. Check propensity score overlap;
  4. Check covariate balance

My personal biggest take away after learning this is that, it is not difficult to perform matching, however, we need to be careful when evaluating the outcomes of matching. How does the overlapping of matched propensity score looks like? Whether the covariates are balanced? What does the treatment difference test tells us? These questions need to be throughly considered.

Next we will continue to share how to estimate the causal effect by inverse probability of treatment weighting. Thank you for reading.

Prudence is a fountain of life to the prudent.