d <- read_csv("gss87 antirel.csv")
flextable(d[1:5,])y1 | y2 | y3 | cum |
|---|---|---|---|
1 | 1 | 1 | 1-1-1 |
1 | 1 | 1 | 1-1-1 |
1 | 2 | 2 | 1-2-2 |
1 | 1 | 1 | 1-1-1 |
1 | 1 | 1 | 1-1-1 |
Expectation Maximization (EM)
The MSI Statistical Training Series offers tutorials on a range of topics that are commonly applied in data analysis. The objective is to i) refresh readers on the derivation and application of methods from a first statistics course, ii) provide the underlying theory that goes beyond an introductory statistics course, iii) introduce readers to new methods that would not be covered in an introductory statistics course, and iv) provide practical, hands-on coding examples to enable users of statistical analysis software to gain proficiency in coding skills not just to run statistical procedures, but also to derive statistical procedures through the underlying theory.
Exploration of each topic should include the underlying theory, a demonstration using toy or fake data, and a real-world application in data that MSI has generated for a client deliverable.
The Expectation Maximization (EM) algorithm is a popular approach to maximum likelihood estimation with missing data. The seminal reference is (Dempster, Laird, and Rubin 1977), which is among the most often cited article in the field of statistics.
The EM algorithm applies to problems of missing data, which has applications beyond the most direct use case of missing data imputation. For example, one may suspect that there is hidden structure in the data, and the missing data are labels identifying this structure (sex in a data set of heights; seasons in a data set of temperatures).
We may have data on an outcome Y which shows a bimodal distribution. We may suspect that this bimodal shape represents the missing data of sex - or in this case a missing variable, or label, that designates which sex the outcome observation belongs to. Or, we may have a set of covariate values for an outcome, and there are particular patterns in the values of these covariates that suggests the presence of a hidden, or latent, variable. This leads to the use of EM for the purposes of identifiying these missing or latent variables, with one approach being that of Latent Class Analysis (LCA).
The EM algorithm comprises two steps. First, we compute the expected value of the missing data, given the values of the observed data and ‘current’ parameter values.1 For a latent class model, we compute \(P(X=c|y_1, …, y_j)\) using the current parameter estimates. We then re-estimate the model probabilities \(P(X=c)\) and \(P(y_j|X=c)\) treating class membership as if it were observed and using the \(P(X=c|y_1, ..., y_j)\) as weights. Somehow, each iteration of this process will improve the estimates until the algorithm converges to a solution.2
1 ‘Current’ only denoting that we first start off with a guess of parameter values, then subsequently use the updated parameter values generated from the previous iteration of the EM algorithm.
2 A proof exists, but I don’t yet understand intuitively how this is possible.
We will demonstrate an application of the EM algorithm using Latent Class Analysis.
Consider a survey asking respondents about their attitude towards the limits of tolerance of anti-religious sentiment.3 The questions ask whether anti-religionists should be allowed to speak (y1), whether anti-religionists should be allowed to teach (y2), and whether anti-religious books ought to be allowed in public libraries (y3). For each item, 1 indicates agreement (allowed to speak, allowed to teach, allowed to keep books in library) and interpreted as high tolerance for anti-religious sentiment, and 2 indicates disagreement and interpreted as low tolerance for anti-religious sentiment.
3 This data comes from the 1987 General Social Science (GSS) survey. See the gssr package for data on all years.
d <- read_csv("gss87 antirel.csv")
flextable(d[1:5,])y1 | y2 | y3 | cum |
|---|---|---|---|
1 | 1 | 1 | 1-1-1 |
1 | 1 | 1 | 1-1-1 |
1 | 2 | 2 | 1-2-2 |
1 | 1 | 1 | 1-1-1 |
1 | 1 | 1 | 1-1-1 |
#| output: asis
tab_stackfrq(d[,1:3],
digits=1)| 1 | 2 | |
|---|---|---|
| y1 | 68.2 % | 31.8 % |
| y2 | 47.7 % | 52.3 % |
| y3 | 66.0 % | 34.0 % |
A slight majority of respondents agreed with items one and three, while item two was about evenly split. Perhaps there are two broad groups of respondents, those who are typically tolerant of anti-religious sentiment, and those who are not. We will now go through a series of steps in implementing the EM algorithm to identify these latent variables.
We are interested in using the patterns of responses on the y1, y2, y3 variables to identify these two broad classes of respondents, so we created a new variable that captures their pattern across each respondent. We can then tabulate the frequencies of these patterns.
freqs <- data.frame(frq(d$cum)) %>%
.[1:8, c(2,4,5)] %>%
mutate(perc=round(raw.prc/100,3)) %>%
dplyr::select(-raw.prc)
flextable(freqs)val | frq | perc |
|---|---|---|
1-1-1 | 696 | 0.406 |
1-1-2 | 68 | 0.040 |
1-2-1 | 275 | 0.160 |
1-2-2 | 130 | 0.076 |
2-1-1 | 34 | 0.020 |
2-1-2 | 19 | 0.011 |
2-2-1 | 125 | 0.073 |
2-2-2 | 366 | 0.214 |
We suspect the pattern of responses of each of the three indicator variables arise due to the existence of two components, or latent variables, in the data. We will call this latent variable X, which can take on values of component 1 (X=1) or component 2 (X=2). We will refer to the observed indicator variables as \(y_1, y_2, .., y_n\).
We want to use the observed frequencies of the response patterns to infer the existence of the components X. To do that, we posit a data-generating process, a model of \(P(y_1, y_2, y_3)\) described in the table above. There are two key assumptions of this model:
The observed frequencies of the pattern of responses \(y_1, y_2, y_3\) is the result of a mixture of two separate components, each with their own distribution. If this were true, then the observed frequency could be expressed as
\[ P(y_1, y_2, y_3)=P(X=1)P(y_1, y_2, y_3|X=1) + P(X=2)P(y_1, y_2, y_3|X=2) \tag{1}\]
Note that the probabilities of the indicator variables for each component4, as well as the share of each component in the sample5 , are not observed - they are the parameters of the model.
Within each component, responses are independent (local independence). Within each component, knowing \(y_1\) offers no information about \(y_2\). If this is true, then the probability of observing a given pattern of responses is simply the product of their probabilities.
\[ P(y_1,y_2,y_3|X=1)=P(y_1|X=1)P(y_2|X=1)P(y_3|X=1) \tag{2}\]
Again note that the notation above refers not to observed probabilities, but the parameter probabilities of the model.
4 \(P(y_i|X=c)\)
5 \(P(X=c)\)
The above model assumptions identify two sets of parameters that will generate our observed information. The first set of parameters is the proportion of the sample belonging to each component. This is \(P(X=1,2)\) from model assumption (1) above. Second, we need the proportion of each indicator variable (the y1, y2, y3) within each assigned component. This is \(P(Y1=1|X=1,2), P(Y2=1|X=1,2), P(Y3=1|X=1,2)\). Knowing these parameter values allows us to generate the joint probability of the pattern of responses in the indicator variables described in model assumption (2).
We will start by arbitrarily setting initial values for these parameters, and then use these values to estimate the probabilities that each observation belongs to each component. Let’s start off by saying that components 1 and 2 are evenly apportioned in the sample, and that the probabilities of an observation being 1 for each of the indicator variables is 60 percent.
p_x1 <- .5 # share of component 1, iteration 1
p_x2 <- .5
y1_x1 <- .6 # probability of y1 for component 1, iteration 1
y2_x1 <- .6
y3_x1 <- .6
y1_x2 <- .4 # probability of y1 for component 2, iteration 1
y2_x2 <- .4
y3_x2 <- .4
params1 <- data.frame(estimate="current",
probs=c("P(X)","P(y1=1|X)","P(y2=1|X)", "P(y3=1|X)"),
iteration=1,
X1=c(.5,.6,.6,.6),
X2=c(.5,.4,.4,.4))
flextable(params1) %>%
set_header_labels(X1="X=1",
X2="X=2")estimate | probs | iteration | X=1 | X=2 |
|---|---|---|---|---|
current | P(X) | 1 | 0.5 | 0.5 |
current | P(y1=1|X) | 1 | 0.6 | 0.4 |
current | P(y2=1|X) | 1 | 0.6 | 0.4 |
current | P(y3=1|X) | 1 | 0.6 | 0.4 |
We now use the initial indicator parameters to estimate a probability of observing a given response pattern \(P(y_1, y_2, y_3|X=x)\), where the \(y_1, y_2, y_3\) are the observed patterns of responses in the sample, and the Xs are the current parameter estimates of the probability of each indicator variable for each component. If you’re trying to follow along, here are the observed indicator response pattern frequencies and initial parameter estimates.
freqs %>%
kable("html", align = 'clc', caption = 'Observed pattern of responses') %>%
kable_styling(full_width = F, position = "float_left")
params1 %>%
kable("html", align = 'clc', caption = 'Initial parameter values') %>%
kable_styling(full_width = F, position = "float_right")| val | frq | perc |
|---|---|---|
| 1-1-1 | 696 | 0.406 |
| 1-1-2 | 68 | 0.040 |
| 1-2-1 | 275 | 0.160 |
| 1-2-2 | 130 | 0.076 |
| 2-1-1 | 34 | 0.020 |
| 2-1-2 | 19 | 0.011 |
| 2-2-1 | 125 | 0.073 |
| 2-2-2 | 366 | 0.214 |
| estimate | probs | iteration | X1 | X2 |
|---|---|---|---|---|
| current | P(X) | 1 | 0.5 | 0.5 |
| current | P(y1=1|X) | 1 | 0.6 | 0.4 |
| current | P(y2=1|X) | 1 | 0.6 | 0.4 |
| current | P(y3=1|X) | 1 | 0.6 | 0.4 |
The most common response pattern is 1-1-1, which occurred in 41 percent of respondents.
Given our initial model parameters that each indicator value has a 60 percent probability of belonging to component 1 and a 40 percent probability of belonging to component 2, then we would expect to see the pattern (1-1-1) to be \(.6*.6*.6\) = 21.6% of the time in component 1, and \(.4*.4*.4\) = 6.4% of the time in component 2. These estimates represent \(P(y_1, y_2, y_3|X=c)\).
Probability of response patterns, if they belonged to component 1.
x1_111 <- y1_x1*y2_x1*y3_x1
# probability of 1-1-1 belonging to component 1
x1_112 <- y1_x1*y1_x1*(1-y3_x1)
# probability of 1-1-2 belonging to component 1
x1_121 <- y1_x1*(1-y2_x1)*y3_x1
# probability of 1-2-1 belonging to component 1
x1_122 <- y1_x1*(1-y2_x1)*(1-y3_x1)
x1_211 <- (1-y1_x1)*y2_x1*y3_x1
x1_212 <- (1-y1_x1)*y2_x1*(1-y3_x1)
x1_221 <- (1-y1_x1)*(1-y2_x1)*(y3_x1)
x1_222 <- (1-y1_x1)*(1-y2_x1)*(1-y3_x1)
jointy_x1 <- c(x1_111, x1_112, x1_121, x1_122, x1_211, x1_212, x1_221, x1_222)Probability of response patterns, if they belonged to component 2.
x2_111 <- y1_x2*y2_x2*y3_x2
# probability of 1-1-1 belonging to component 1
x2_112 <- y1_x2*y1_x2*(1-y3_x2)
# probability of 1-1-2 belonging to component 1
x2_121 <- y1_x2*(1-y2_x2)*y3_x2
# probability of 1-2-1 belonging to component 1
x2_122 <- y1_x2*(1-y2_x2)*(1-y3_x2)
x2_211 <- (1-y1_x2)*y2_x2*y3_x2
x2_212 <- (1-y1_x2)*y2_x2*(1-y3_x2)
x2_221 <- (1-y1_x2)*(1-y2_x2)*(y3_x2)
x2_222 <- (1-y1_x2)*(1-y2_x2)*(1-y3_x2)
jointy_x2 <- c(x2_111, x2_112, x2_121, x2_122, x2_211, x2_212, x2_221, x2_222)
freqs1 <- freqs %>%
mutate(iteration=1,
joint_x1=jointy_x1,
joint_x2=jointy_x2)
flextable(freqs1)val | frq | perc | iteration | joint_x1 | joint_x2 |
|---|---|---|---|---|---|
1-1-1 | 696 | 0.406 | 1 | 0.216 | 0.064 |
1-1-2 | 68 | 0.040 | 1 | 0.144 | 0.096 |
1-2-1 | 275 | 0.160 | 1 | 0.144 | 0.096 |
1-2-2 | 130 | 0.076 | 1 | 0.096 | 0.144 |
2-1-1 | 34 | 0.020 | 1 | 0.144 | 0.096 |
2-1-2 | 19 | 0.011 | 1 | 0.096 | 0.144 |
2-2-1 | 125 | 0.073 | 1 | 0.096 | 0.144 |
2-2-2 | 366 | 0.214 | 1 | 0.064 | 0.216 |
For each indicator selection pattern, we multiply the posited model probabilities together to get the joint probability of observing that pattern, if the posited parameter values were true. We do that for each posited component.
We then invoke Assumption 16 to estimate the observed frequency of the response pattern as the sum of two mixtures.
6 \[ P(y_1, y_2, y_3)=P(X=1)P(y_1, y_2, y_3|X=1) + P(X=2)P(y_1, y_2, y_3|X=2) \]
freqs1 <- freqs1 %>%
mutate(joint_y=joint_x1*p_x1 + joint_x2*p_x2)
flextable(freqs1)val | frq | perc | iteration | joint_x1 | joint_x2 | joint_y |
|---|---|---|---|---|---|---|
1-1-1 | 696 | 0.406 | 1 | 0.216 | 0.064 | 0.14 |
1-1-2 | 68 | 0.040 | 1 | 0.144 | 0.096 | 0.12 |
1-2-1 | 275 | 0.160 | 1 | 0.144 | 0.096 | 0.12 |
1-2-2 | 130 | 0.076 | 1 | 0.096 | 0.144 | 0.12 |
2-1-1 | 34 | 0.020 | 1 | 0.144 | 0.096 | 0.12 |
2-1-2 | 19 | 0.011 | 1 | 0.096 | 0.144 | 0.12 |
2-2-1 | 125 | 0.073 | 1 | 0.096 | 0.144 | 0.12 |
2-2-2 | 366 | 0.214 | 1 | 0.064 | 0.216 | 0.14 |
If the parameter estimates are true, then the generated joint probability of the indicator variables (joint_y) should match the observed joint probabilities of the indicator variables (perc). After this first iteration, they are not close.
The steps we have taken so far have given us the predicted joint probability of observing a pattern of indicator variables, if the posited parameter values are true. This may be expressed as \(P(y|X=c)\). Recall from the introduction of the EM algorithm is that what we are seeking to estimate, the E step, is \(P(X=c|y)\). This would be an assignation of a probability of belonging to each component c, given the observed response pattern of the indicator variables.
We can apply Bayes Theorem to recover our desired quantity.
\[ P(X=c|y_1, .., y_j)=\frac{P(X=c)P(y_1, .., y_j|X=c)}{P(y_1, .., y_j)} \]
estimate | probs | iteration | X1 | X2 |
|---|---|---|---|---|
current | P(X) | 1 | 0.5 | 0.5 |
current | P(y1=1|X) | 1 | 0.6 | 0.4 |
current | P(y2=1|X) | 1 | 0.6 | 0.4 |
current | P(y3=1|X) | 1 | 0.6 | 0.4 |
freqs1 <- freqs1 %>%
mutate(x1_y=(p_x1*joint_x1)/joint_y,
x2_y=(p_x2*joint_x2)/joint_y)
flextable(freqs1)val | frq | perc | iteration | joint_x1 | joint_x2 | joint_y | x1_y | x2_y |
|---|---|---|---|---|---|---|---|---|
1-1-1 | 696 | 0.406 | 1 | 0.216 | 0.064 | 0.14 | 0.771 | 0.229 |
1-1-2 | 68 | 0.040 | 1 | 0.144 | 0.096 | 0.12 | 0.600 | 0.400 |
1-2-1 | 275 | 0.160 | 1 | 0.144 | 0.096 | 0.12 | 0.600 | 0.400 |
1-2-2 | 130 | 0.076 | 1 | 0.096 | 0.144 | 0.12 | 0.400 | 0.600 |
2-1-1 | 34 | 0.020 | 1 | 0.144 | 0.096 | 0.12 | 0.600 | 0.400 |
2-1-2 | 19 | 0.011 | 1 | 0.096 | 0.144 | 0.12 | 0.400 | 0.600 |
2-2-1 | 125 | 0.073 | 1 | 0.096 | 0.144 | 0.12 | 0.400 | 0.600 |
2-2-2 | 366 | 0.214 | 1 | 0.064 | 0.216 | 0.14 | 0.229 | 0.771 |
Using the values of our initial parameter estimates, we have generated estimates of the probability of the respondent belonging to a given component, given their response pattern on the indicator variables. We can now use these estimated probabilities to estimate the number of respondents we expect to belong to each component.
freqs1 <- freqs1 %>%
mutate(frq_x1=x1_y*frq,
frq_x2=x2_y*frq)
flextable(freqs1)val | frq | perc | iteration | joint_x1 | joint_x2 | joint_y | x1_y | x2_y | frq_x1 | frq_x2 |
|---|---|---|---|---|---|---|---|---|---|---|
1-1-1 | 696 | 0.406 | 1 | 0.216 | 0.064 | 0.14 | 0.771 | 0.229 | 536.9 | 159.1 |
1-1-2 | 68 | 0.040 | 1 | 0.144 | 0.096 | 0.12 | 0.600 | 0.400 | 40.8 | 27.2 |
1-2-1 | 275 | 0.160 | 1 | 0.144 | 0.096 | 0.12 | 0.600 | 0.400 | 165.0 | 110.0 |
1-2-2 | 130 | 0.076 | 1 | 0.096 | 0.144 | 0.12 | 0.400 | 0.600 | 52.0 | 78.0 |
2-1-1 | 34 | 0.020 | 1 | 0.144 | 0.096 | 0.12 | 0.600 | 0.400 | 20.4 | 13.6 |
2-1-2 | 19 | 0.011 | 1 | 0.096 | 0.144 | 0.12 | 0.400 | 0.600 | 7.6 | 11.4 |
2-2-1 | 125 | 0.073 | 1 | 0.096 | 0.144 | 0.12 | 0.400 | 0.600 | 50.0 | 75.0 |
2-2-2 | 366 | 0.214 | 1 | 0.064 | 0.216 | 0.14 | 0.229 | 0.771 | 83.7 | 282.3 |
Now that we have the predicted frequencies of cases exhibiting a particular response pattern for each component, we will use these probabilities to predict overall frequency counts of the parameters.
x1_freqs <- sum(freqs1$frq_x1)
#x1_freqs2
x2_freqs <- sum(freqs1$frq_x2)
#x2_freqs2
f_y1_c1 <- sum(freqs1[1:4,10])
#f1_y1_c1
f_y1_c2 <- sum(freqs1[1:4,11])
#f1_y1_c2
f_y2_c1 <- sum(freqs1[c(1:2, 5:6),10])
#f1_y2_c1
f_y2_c2 <- sum(freqs1[c(1:2, 5:6), 11])
#f1_y2_c2
f_y3_c1 <- sum(freqs1[c(1,3,5,7), 10])
#f1_y3_c1
f_y3_c2 <- sum(freqs1[c(1,3,5,7), 11])
#f1_y3_c2
newfreqs <- data.frame(estimate="new",
probs=params1$probs,
iteration=1) %>%
mutate(X1=c(x1_freqs,
f_y1_c1,
f_y2_c1,
f_y3_c1),
X2=c(x2_freqs,
f_y1_c2,
f_y2_c2,
f_y3_c2))
newfreqs_flex <- newfreqs %>%
flextable()
newfreqs_flexestimate | probs | iteration | X1 | X2 |
|---|---|---|---|---|
new | P(X) | 1 | 956 | 757 |
new | P(y1=1|X) | 1 | 795 | 374 |
new | P(y2=1|X) | 1 | 606 | 211 |
new | P(y3=1|X) | 1 | 772 | 358 |
Now we use these predicted frequency counts to generate new parameter probabilities
pnew_x1 <- x1_freqs / (x1_freqs + x2_freqs)
#pnew_x1
pnew_x2 <- x2_freqs / (x1_freqs + x2_freqs)
#pnew_x2
# y's for component 1
y1_x1 <- f_y1_c1 / x1_freqs
#y1_x1
y2_x1 <- f_y2_c1 / x1_freqs
#y2_x1
y3_x1 <- f_y3_c1 / x1_freqs
#y3_x1
# y's for component 2
y1_x2 <- f_y1_c2 / x2_freqs
#y1_x2
y2_x2 <- f_y2_c2 / x2_freqs
#y2_x2
y3_x2 <- f_y3_c2 / x2_freqs
#y3_x2Arrange new model parameters into a table.
new <- data.frame(estimate="new",
probs=params1$probs,
iteration=1) %>%
mutate(X1=c(pnew_x1,
y1_x1,
y2_x1,
y3_x1),
X2=c(pnew_x2,
y1_x2,
y2_x2,
y3_x2))
#flextable(new)
params1 <- bind_rows(params1, new)
flextable(params1) estimate | probs | iteration | X1 | X2 |
|---|---|---|---|---|
current | P(X) | 1 | 0.500 | 0.500 |
current | P(y1=1|X) | 1 | 0.600 | 0.400 |
current | P(y2=1|X) | 1 | 0.600 | 0.400 |
current | P(y3=1|X) | 1 | 0.600 | 0.400 |
new | P(X) | 1 | 0.558 | 0.442 |
new | P(y1=1|X) | 1 | 0.831 | 0.495 |
new | P(y2=1|X) | 1 | 0.633 | 0.279 |
new | P(y3=1|X) | 1 | 0.808 | 0.473 |
Comparing initial to new estimates of model parameters, we see that the estimate of the share of component 1 in the sample increases to 56 percent, while the estimate of the share of component 2 in the sample decreases to 44 percent. Probability of belonging to component 1 increased for all indicator variables, and decreased for all indicator variables belonging to comnponent 2.
What are we to make of this? There exists a proof that these new estimates better describe the observed data. We know this with certainty if we had known the hidden labels beforehand but blinded them in this analysis. But without the hidden knowledge, how do we know?
We can take the log likelihood of our response patterns at each iteration to see whether we are arriving at some sort of ground-truth. It can be shown that the log likelihood is guaranteed to increase after each iteration of the algorithm.
freqs1 <- freqs1 %>%
mutate(ll=frq*log(joint_y))
flextable(freqs1)val | frq | perc | iteration | joint_x1 | joint_x2 | joint_y | x1_y | x2_y | frq_x1 | frq_x2 | ll |
|---|---|---|---|---|---|---|---|---|---|---|---|
1-1-1 | 696 | 0.406 | 1 | 0.216 | 0.064 | 0.14 | 0.771 | 0.229 | 536.9 | 159.1 | -1,368.4 |
1-1-2 | 68 | 0.040 | 1 | 0.144 | 0.096 | 0.12 | 0.600 | 0.400 | 40.8 | 27.2 | -144.2 |
1-2-1 | 275 | 0.160 | 1 | 0.144 | 0.096 | 0.12 | 0.600 | 0.400 | 165.0 | 110.0 | -583.1 |
1-2-2 | 130 | 0.076 | 1 | 0.096 | 0.144 | 0.12 | 0.400 | 0.600 | 52.0 | 78.0 | -275.6 |
2-1-1 | 34 | 0.020 | 1 | 0.144 | 0.096 | 0.12 | 0.600 | 0.400 | 20.4 | 13.6 | -72.1 |
2-1-2 | 19 | 0.011 | 1 | 0.096 | 0.144 | 0.12 | 0.400 | 0.600 | 7.6 | 11.4 | -40.3 |
2-2-1 | 125 | 0.073 | 1 | 0.096 | 0.144 | 0.12 | 0.400 | 0.600 | 50.0 | 75.0 | -265.0 |
2-2-2 | 366 | 0.214 | 1 | 0.064 | 0.216 | 0.14 | 0.229 | 0.771 | 83.7 | 282.3 | -719.6 |
ll_1 <- sum(freqs1$ll)
ll_1[1] -3468
Let’s now go through the iteration again, using the new model parameter estimates.
params2 <- new %>%
mutate(estimate="current",
iteration=2)
flextable(params2)estimate | probs | iteration | X1 | X2 |
|---|---|---|---|---|
current | P(X) | 2 | 0.558 | 0.442 |
current | P(y1=1|X) | 2 | 0.831 | 0.495 |
current | P(y2=1|X) | 2 | 0.633 | 0.279 |
current | P(y3=1|X) | 2 | 0.808 | 0.473 |
p_x1 <- params2[1,4] # share of component 1
p_x2 <- params2[1,5]
y1_x1 <- params2[2,4] # probability of y1 for component 1
y2_x1 <- params2[3,4]
y3_x1 <- params2[4,4]
y1_x2 <- params2[2,5] # probability of y1 for component 2
y2_x2 <- params2[3,5]
y3_x2 <- params2[4,5]Probability of response patterns, if they belonged to component 1.
x1_111 <- y1_x1*y2_x1*y3_x1
# probability of 1-1-1 belonging to component 1
x1_112 <- y1_x1*y1_x1*(1-y3_x1)
# probability of 1-1-2 belonging to component 1
x1_121 <- y1_x1*(1-y2_x1)*y3_x1
# probability of 1-2-1 belonging to component 1
x1_122 <- y1_x1*(1-y2_x1)*(1-y3_x1)
x1_211 <- (1-y1_x1)*y2_x1*y3_x1
x1_212 <- (1-y1_x1)*y2_x1*(1-y3_x1)
x1_221 <- (1-y1_x1)*(1-y2_x1)*(y3_x1)
x1_222 <- (1-y1_x1)*(1-y2_x1)*(1-y3_x1)
jointy_x1 <- c(x1_111, x1_112, x1_121, x1_122, x1_211, x1_212, x1_221, x1_222)Probability of response patterns, if they belonged to component 2.
x2_111 <- y1_x2*y2_x2*y3_x2
# probability of 1-1-1 belonging to component 1
x2_112 <- y1_x2*y1_x2*(1-y3_x2)
# probability of 1-1-2 belonging to component 1
x2_121 <- y1_x2*(1-y2_x2)*y3_x2
# probability of 1-2-1 belonging to component 1
x2_122 <- y1_x2*(1-y2_x2)*(1-y3_x2)
x2_211 <- (1-y1_x2)*y2_x2*y3_x2
x2_212 <- (1-y1_x2)*y2_x2*(1-y3_x2)
x2_221 <- (1-y1_x2)*(1-y2_x2)*(y3_x2)
x2_222 <- (1-y1_x2)*(1-y2_x2)*(1-y3_x2)
jointy_x2 <- c(x2_111, x2_112, x2_121, x2_122, x2_211, x2_212, x2_221, x2_222)
freqs2 <- freqs %>%
mutate(iteration=2,
joint_x1=jointy_x1,
joint_x2=jointy_x2,
joint_y=joint_x1*p_x1 + joint_x2*p_x2,
x1_y=(p_x1*joint_x1)/joint_y,
x2_y=(p_x2*joint_x2)/joint_y,
frq_x1=x1_y*frq,
frq_x2=x2_y*frq)
flextable(freqs2)val | frq | perc | iteration | joint_x1 | joint_x2 | joint_y | x1_y | x2_y | frq_x1 | frq_x2 |
|---|---|---|---|---|---|---|---|---|---|---|
1-1-1 | 696 | 0.406 | 2 | 0.4250 | 0.0653 | 0.2661 | 0.8916 | 0.108 | 620.56 | 75.4 |
1-1-2 | 68 | 0.040 | 2 | 0.1329 | 0.1290 | 0.1312 | 0.5656 | 0.434 | 38.46 | 29.5 |
1-2-1 | 275 | 0.160 | 2 | 0.2460 | 0.1685 | 0.2118 | 0.6485 | 0.351 | 178.34 | 96.7 |
1-2-2 | 130 | 0.076 | 2 | 0.0586 | 0.1880 | 0.1158 | 0.2828 | 0.717 | 36.76 | 93.2 |
2-1-1 | 34 | 0.020 | 2 | 0.0865 | 0.0667 | 0.0777 | 0.6209 | 0.379 | 21.11 | 12.9 |
2-1-2 | 19 | 0.011 | 2 | 0.0206 | 0.0744 | 0.0444 | 0.2593 | 0.741 | 4.93 | 14.1 |
2-2-1 | 125 | 0.073 | 2 | 0.0500 | 0.1722 | 0.1040 | 0.2687 | 0.731 | 33.59 | 91.4 |
2-2-2 | 366 | 0.214 | 2 | 0.0119 | 0.1920 | 0.0915 | 0.0728 | 0.927 | 26.64 | 339.4 |
x1_freqs <- sum(freqs2$frq_x1)
#x1_freqs2
x2_freqs <- sum(freqs2$frq_x2)
#x2_freqs2
f_y1_c1 <- sum(freqs2[1:4,10])
#f1_y1_c1
f_y1_c2 <- sum(freqs2[1:4,11])
#f1_y1_c2
f_y2_c1 <- sum(freqs2[c(1:2, 5:6),10])
#f1_y2_c1
f_y2_c2 <- sum(freqs2[c(1:2, 5:6), 11])
#f1_y2_c2
f_y3_c1 <- sum(freqs2[c(1,3,5,7), 10])
#f1_y3_c1
f_y3_c2 <- sum(freqs2[c(1,3,5,7), 11])
#f1_y3_c2
newfreqs <- data.frame(estimate="new",
probs=params2$probs,
iteration=2) %>%
mutate(X1=c(x1_freqs,
f_y1_c1,
f_y2_c1,
f_y3_c1),
X2=c(x2_freqs,
f_y1_c2,
f_y2_c2,
f_y3_c2))
newfreqs_flex <- newfreqs %>%
flextable()
newfreqs_flexestimate | probs | iteration | X1 | X2 |
|---|---|---|---|---|
new | P(X) | 2 | 960 | 753 |
new | P(y1=1|X) | 2 | 874 | 295 |
new | P(y2=1|X) | 2 | 685 | 132 |
new | P(y3=1|X) | 2 | 854 | 276 |
pnew_x1 <- x1_freqs / (x1_freqs + x2_freqs)
#pnew_x1
pnew_x2 <- x2_freqs / (x1_freqs + x2_freqs)
#pnew_x2
# y's for component 1
y1_x1 <- f_y1_c1 / x1_freqs
#y1_x1
y2_x1 <- f_y2_c1 / x1_freqs
#y2_x1
y3_x1 <- f_y3_c1 / x1_freqs
#y3_x1
# y's for component 2
y1_x2 <- f_y1_c2 / x2_freqs
#y1_x2
y2_x2 <- f_y2_c2 / x2_freqs
#y2_x2
y3_x2 <- f_y3_c2 / x2_freqs
#y3_x2Arrange new model parameters into a table.
new <- data.frame(estimate="new",
probs=params1$probs[1:4],
iteration=2) %>%
mutate(X1=c(pnew_x1,
y1_x1,
y2_x1,
y3_x1),
X2=c(pnew_x2,
y1_x2,
y2_x2,
y3_x2))
#flextable(new)
params2 <- bind_rows(params2, new)
flextable(params2) estimate | probs | iteration | X1 | X2 |
|---|---|---|---|---|
current | P(X) | 2 | 0.558 | 0.442 |
current | P(y1=1|X) | 2 | 0.831 | 0.495 |
current | P(y2=1|X) | 2 | 0.633 | 0.279 |
current | P(y3=1|X) | 2 | 0.808 | 0.473 |
new | P(X) | 2 | 0.561 | 0.439 |
new | P(y1=1|X) | 2 | 0.910 | 0.392 |
new | P(y2=1|X) | 2 | 0.713 | 0.175 |
new | P(y3=1|X) | 2 | 0.889 | 0.367 |
As with iteration 1, the probability of component 1 has increased, while the probability for component 2 has decreased. Item agreement is more associated with component 1 than component 2.
Now let’s look at the log likelihood.
freqs2 <- freqs2 %>%
mutate(ll=frq*log(joint_y))
flextable(freqs2)val | frq | perc | iteration | joint_x1 | joint_x2 | joint_y | x1_y | x2_y | frq_x1 | frq_x2 | ll |
|---|---|---|---|---|---|---|---|---|---|---|---|
1-1-1 | 696 | 0.406 | 2 | 0.4250 | 0.0653 | 0.2661 | 0.8916 | 0.108 | 620.56 | 75.4 | -921.4 |
1-1-2 | 68 | 0.040 | 2 | 0.1329 | 0.1290 | 0.1312 | 0.5656 | 0.434 | 38.46 | 29.5 | -138.1 |
1-2-1 | 275 | 0.160 | 2 | 0.2460 | 0.1685 | 0.2118 | 0.6485 | 0.351 | 178.34 | 96.7 | -426.8 |
1-2-2 | 130 | 0.076 | 2 | 0.0586 | 0.1880 | 0.1158 | 0.2828 | 0.717 | 36.76 | 93.2 | -280.3 |
2-1-1 | 34 | 0.020 | 2 | 0.0865 | 0.0667 | 0.0777 | 0.6209 | 0.379 | 21.11 | 12.9 | -86.9 |
2-1-2 | 19 | 0.011 | 2 | 0.0206 | 0.0744 | 0.0444 | 0.2593 | 0.741 | 4.93 | 14.1 | -59.2 |
2-2-1 | 125 | 0.073 | 2 | 0.0500 | 0.1722 | 0.1040 | 0.2687 | 0.731 | 33.59 | 91.4 | -282.9 |
2-2-2 | 366 | 0.214 | 2 | 0.0119 | 0.1920 | 0.0915 | 0.0728 | 0.927 | 26.64 | 339.4 | -875.3 |
ll_2 <- sum(freqs2$ll)
ll_1[1] -3468
ll_2[1] -3071
The overall log likelihood has increased, which means we are closer to our convergance (maximization of the likelihood of the data).
These iterations continue, with each iteration guaranteed to increase the log likelihood of the model. When the iterations result in negligible increases to the log likelihood, the algorithm is said to have converged and we have arrived at a solution.
It’s possible that the algorithm identified a local maximum, rather than a global maximum. Initialize a few different random starts to see if you get to the same solution. If you do not, take the solution with the highest log likelihood.