MSI Statistical Training Series

Expectation Maximization (EM)

Published

April 1, 2023

Background

The MSI Statistical Training Series offers tutorials on a range of topics that are commonly applied in data analysis. The objective is to i) refresh readers on the derivation and application of methods from a first statistics course, ii) provide the underlying theory that goes beyond an introductory statistics course, iii) introduce readers to new methods that would not be covered in an introductory statistics course, and iv) provide practical, hands-on coding examples to enable users of statistical analysis software to gain proficiency in coding skills not just to run statistical procedures, but also to derive statistical procedures through the underlying theory.

Exploration of each topic should include the underlying theory, a demonstration using toy or fake data, and a real-world application in data that MSI has generated for a client deliverable.

Introduction to Expectation Maximization

The Expectation Maximization (EM) algorithm is a popular approach to maximum likelihood estimation with missing data. The seminal reference is (Dempster, Laird, and Rubin 1977), which is among the most often cited article in the field of statistics.

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data Via the EM Algorithm.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.

The EM algorithm applies to problems of missing data, which has applications beyond the most direct use case of missing data imputation. For example, one may suspect that there is hidden structure in the data, and the missing data are labels identifying this structure (sex in a data set of heights; seasons in a data set of temperatures).

We may have data on an outcome Y which shows a bimodal distribution. We may suspect that this bimodal shape represents the missing data of sex - or in this case a missing variable, or label, that designates which sex the outcome observation belongs to. Or, we may have a set of covariate values for an outcome, and there are particular patterns in the values of these covariates that suggests the presence of a hidden, or latent, variable. This leads to the use of EM for the purposes of identifiying these missing or latent variables, with one approach being that of Latent Class Analysis (LCA).

The EM algorithm comprises two steps. First, we compute the expected value of the missing data, given the values of the observed data and ‘current’ parameter values.1 For a latent class model, we compute \(P(X=c|y_1, …, y_j)\) using the current parameter estimates. We then re-estimate the model probabilities \(P(X=c)\) and \(P(y_j|X=c)\) treating class membership as if it were observed and using the \(P(X=c|y_1, ..., y_j)\) as weights. Somehow, each iteration of this process will improve the estimates until the algorithm converges to a solution.2

  • 1 ‘Current’ only denoting that we first start off with a guess of parameter values, then subsequently use the updated parameter values generated from the previous iteration of the EM algorithm.

  • 2 A proof exists, but I don’t yet understand intuitively how this is possible.

  • EM Demonstration Using LCA

    We will demonstrate an application of the EM algorithm using Latent Class Analysis.

    Consider a survey asking respondents about their attitude towards the limits of tolerance of anti-religious sentiment.3 The questions ask whether anti-religionists should be allowed to speak (y1), whether anti-religionists should be allowed to teach (y2), and whether anti-religious books ought to be allowed in public libraries (y3). For each item, 1 indicates agreement (allowed to speak, allowed to teach, allowed to keep books in library) and interpreted as high tolerance for anti-religious sentiment, and 2 indicates disagreement and interpreted as low tolerance for anti-religious sentiment.

  • 3 This data comes from the 1987 General Social Science (GSS) survey. See the gssr package for data on all years.

  • d <- read_csv("gss87 antirel.csv")
    flextable(d[1:5,])

    y1

    y2

    y3

    cum

    1

    1

    1

    1-1-1

    1

    1

    1

    1-1-1

    1

    2

    2

    1-2-2

    1

    1

    1

    1-1-1

    1

    1

    1

    1-1-1

    #| output: asis
    
    tab_stackfrq(d[,1:3],
                 digits=1)
      1 2
    y1 68.2 % 31.8 %
    y2 47.7 % 52.3 %
    y3 66.0 % 34.0 %

    A slight majority of respondents agreed with items one and three, while item two was about evenly split. Perhaps there are two broad groups of respondents, those who are typically tolerant of anti-religious sentiment, and those who are not. We will now go through a series of steps in implementing the EM algorithm to identify these latent variables.

    1. Tabulate the response patterns

    We are interested in using the patterns of responses on the y1, y2, y3 variables to identify these two broad classes of respondents, so we created a new variable that captures their pattern across each respondent. We can then tabulate the frequencies of these patterns.

    freqs <- data.frame(frq(d$cum)) %>%
      .[1:8, c(2,4,5)] %>%
      mutate(perc=round(raw.prc/100,3)) %>%
      dplyr::select(-raw.prc)
    
    flextable(freqs)

    val

    frq

    perc

    1-1-1

    696

    0.406

    1-1-2

    68

    0.040

    1-2-1

    275

    0.160

    1-2-2

    130

    0.076

    2-1-1

    34

    0.020

    2-1-2

    19

    0.011

    2-2-1

    125

    0.073

    2-2-2

    366

    0.214

    We suspect the pattern of responses of each of the three indicator variables arise due to the existence of two components, or latent variables, in the data. We will call this latent variable X, which can take on values of component 1 (X=1) or component 2 (X=2). We will refer to the observed indicator variables as \(y_1, y_2, .., y_n\).

    We want to use the observed frequencies of the response patterns to infer the existence of the components X. To do that, we posit a data-generating process, a model of \(P(y_1, y_2, y_3)\) described in the table above. There are two key assumptions of this model:

    1. The observed frequencies of the pattern of responses \(y_1, y_2, y_3\) is the result of a mixture of two separate components, each with their own distribution. If this were true, then the observed frequency could be expressed as

      \[ P(y_1, y_2, y_3)=P(X=1)P(y_1, y_2, y_3|X=1) + P(X=2)P(y_1, y_2, y_3|X=2) \tag{1}\]

      Note that the probabilities of the indicator variables for each component4, as well as the share of each component in the sample5 , are not observed - they are the parameters of the model.

    2. Within each component, responses are independent (local independence). Within each component, knowing \(y_1\) offers no information about \(y_2\). If this is true, then the probability of observing a given pattern of responses is simply the product of their probabilities.

      \[ P(y_1,y_2,y_3|X=1)=P(y_1|X=1)P(y_2|X=1)P(y_3|X=1) \tag{2}\]

      Again note that the notation above refers not to observed probabilities, but the parameter probabilities of the model.

  • 4 \(P(y_i|X=c)\)

  • 5 \(P(X=c)\)

  • The above model assumptions identify two sets of parameters that will generate our observed information. The first set of parameters is the proportion of the sample belonging to each component. This is \(P(X=1,2)\) from model assumption (1) above. Second, we need the proportion of each indicator variable (the y1, y2, y3) within each assigned component. This is \(P(Y1=1|X=1,2), P(Y2=1|X=1,2), P(Y3=1|X=1,2)\). Knowing these parameter values allows us to generate the joint probability of the pattern of responses in the indicator variables described in model assumption (2).

    2. Set initial values of model parameters

    We will start by arbitrarily setting initial values for these parameters, and then use these values to estimate the probabilities that each observation belongs to each component. Let’s start off by saying that components 1 and 2 are evenly apportioned in the sample, and that the probabilities of an observation being 1 for each of the indicator variables is 60 percent.

    p_x1 <- .5 # share of component 1, iteration 1
    p_x2 <- .5
    
    y1_x1 <- .6 # probability of y1 for component 1, iteration 1
    y2_x1 <- .6
    y3_x1 <- .6
    
    y1_x2 <- .4 # probability of y1 for component 2, iteration 1
    y2_x2 <- .4
    y3_x2 <- .4
    
    params1 <- data.frame(estimate="current",
                          probs=c("P(X)","P(y1=1|X)","P(y2=1|X)", "P(y3=1|X)"),
                          iteration=1,
                          X1=c(.5,.6,.6,.6),
                          X2=c(.5,.4,.4,.4))
    
    flextable(params1) %>%
      set_header_labels(X1="X=1",
                        X2="X=2")

    estimate

    probs

    iteration

    X=1

    X=2

    current

    P(X)

    1

    0.5

    0.5

    current

    P(y1=1|X)

    1

    0.6

    0.4

    current

    P(y2=1|X)

    1

    0.6

    0.4

    current

    P(y3=1|X)

    1

    0.6

    0.4

    3. Estimate probabilities of response patterns

    We now use the initial indicator parameters to estimate a probability of observing a given response pattern \(P(y_1, y_2, y_3|X=x)\), where the \(y_1, y_2, y_3\) are the observed patterns of responses in the sample, and the Xs are the current parameter estimates of the probability of each indicator variable for each component. If you’re trying to follow along, here are the observed indicator response pattern frequencies and initial parameter estimates.

    freqs %>%
      kable("html", align = 'clc', caption = 'Observed pattern of responses') %>%
        kable_styling(full_width = F, position = "float_left")
    
    params1 %>%
      kable("html", align = 'clc', caption = 'Initial parameter values') %>%
        kable_styling(full_width = F, position = "float_right")
    Observed pattern of responses
    val frq perc
    1-1-1 696 0.406
    1-1-2 68 0.040
    1-2-1 275 0.160
    1-2-2 130 0.076
    2-1-1 34 0.020
    2-1-2 19 0.011
    2-2-1 125 0.073
    2-2-2 366 0.214
    Initial parameter values
    estimate probs iteration X1 X2
    current P(X) 1 0.5 0.5
    current P(y1=1|X) 1 0.6 0.4
    current P(y2=1|X) 1 0.6 0.4
    current P(y3=1|X) 1 0.6 0.4

    The most common response pattern is 1-1-1, which occurred in 41 percent of respondents.

    Given our initial model parameters that each indicator value has a 60 percent probability of belonging to component 1 and a 40 percent probability of belonging to component 2, then we would expect to see the pattern (1-1-1) to be \(.6*.6*.6\) = 21.6% of the time in component 1, and \(.4*.4*.4\) = 6.4% of the time in component 2. These estimates represent \(P(y_1, y_2, y_3|X=c)\).

    Probability of response patterns, if they belonged to component 1.

    x1_111 <- y1_x1*y2_x1*y3_x1 
    # probability of 1-1-1 belonging to component 1
    x1_112 <- y1_x1*y1_x1*(1-y3_x1)
    # probability of 1-1-2 belonging to component 1
    x1_121 <- y1_x1*(1-y2_x1)*y3_x1
    # probability of 1-2-1 belonging to component 1
    x1_122 <- y1_x1*(1-y2_x1)*(1-y3_x1)
    x1_211 <- (1-y1_x1)*y2_x1*y3_x1
    x1_212 <- (1-y1_x1)*y2_x1*(1-y3_x1)
    x1_221 <- (1-y1_x1)*(1-y2_x1)*(y3_x1)
    x1_222 <- (1-y1_x1)*(1-y2_x1)*(1-y3_x1)
    
    jointy_x1 <- c(x1_111, x1_112, x1_121, x1_122, x1_211, x1_212, x1_221, x1_222)

    Probability of response patterns, if they belonged to component 2.

    x2_111 <- y1_x2*y2_x2*y3_x2 
    # probability of 1-1-1 belonging to component 1
    x2_112 <- y1_x2*y1_x2*(1-y3_x2)
    # probability of 1-1-2 belonging to component 1
    x2_121 <- y1_x2*(1-y2_x2)*y3_x2
    # probability of 1-2-1 belonging to component 1
    x2_122 <- y1_x2*(1-y2_x2)*(1-y3_x2)
    x2_211 <- (1-y1_x2)*y2_x2*y3_x2
    x2_212 <- (1-y1_x2)*y2_x2*(1-y3_x2)
    x2_221 <- (1-y1_x2)*(1-y2_x2)*(y3_x2)
    x2_222 <- (1-y1_x2)*(1-y2_x2)*(1-y3_x2)
    
    jointy_x2 <- c(x2_111, x2_112, x2_121, x2_122, x2_211, x2_212, x2_221, x2_222)
    
    freqs1 <- freqs %>%
      mutate(iteration=1,
             joint_x1=jointy_x1,
             joint_x2=jointy_x2)
    
    flextable(freqs1)

    val

    frq

    perc

    iteration

    joint_x1

    joint_x2

    1-1-1

    696

    0.406

    1

    0.216

    0.064

    1-1-2

    68

    0.040

    1

    0.144

    0.096

    1-2-1

    275

    0.160

    1

    0.144

    0.096

    1-2-2

    130

    0.076

    1

    0.096

    0.144

    2-1-1

    34

    0.020

    1

    0.144

    0.096

    2-1-2

    19

    0.011

    1

    0.096

    0.144

    2-2-1

    125

    0.073

    1

    0.096

    0.144

    2-2-2

    366

    0.214

    1

    0.064

    0.216

    For each indicator selection pattern, we multiply the posited model probabilities together to get the joint probability of observing that pattern, if the posited parameter values were true. We do that for each posited component.

    We then invoke Assumption 16 to estimate the observed frequency of the response pattern as the sum of two mixtures.

  • 6 \[ P(y_1, y_2, y_3)=P(X=1)P(y_1, y_2, y_3|X=1) + P(X=2)P(y_1, y_2, y_3|X=2) \]

  • freqs1 <- freqs1 %>%
      mutate(joint_y=joint_x1*p_x1 + joint_x2*p_x2)
    
    flextable(freqs1)

    val

    frq

    perc

    iteration

    joint_x1

    joint_x2

    joint_y

    1-1-1

    696

    0.406

    1

    0.216

    0.064

    0.14

    1-1-2

    68

    0.040

    1

    0.144

    0.096

    0.12

    1-2-1

    275

    0.160

    1

    0.144

    0.096

    0.12

    1-2-2

    130

    0.076

    1

    0.096

    0.144

    0.12

    2-1-1

    34

    0.020

    1

    0.144

    0.096

    0.12

    2-1-2

    19

    0.011

    1

    0.096

    0.144

    0.12

    2-2-1

    125

    0.073

    1

    0.096

    0.144

    0.12

    2-2-2

    366

    0.214

    1

    0.064

    0.216

    0.14

    If the parameter estimates are true, then the generated joint probability of the indicator variables (joint_y) should match the observed joint probabilities of the indicator variables (perc). After this first iteration, they are not close.

    4. Estimate the component probabilities within each component

    The steps we have taken so far have given us the predicted joint probability of observing a pattern of indicator variables, if the posited parameter values are true. This may be expressed as \(P(y|X=c)\). Recall from the introduction of the EM algorithm is that what we are seeking to estimate, the E step, is \(P(X=c|y)\). This would be an assignation of a probability of belonging to each component c, given the observed response pattern of the indicator variables.

    We can apply Bayes Theorem to recover our desired quantity.

    \[ P(X=c|y_1, .., y_j)=\frac{P(X=c)P(y_1, .., y_j|X=c)}{P(y_1, .., y_j)} \]

    estimate

    probs

    iteration

    X1

    X2

    current

    P(X)

    1

    0.5

    0.5

    current

    P(y1=1|X)

    1

    0.6

    0.4

    current

    P(y2=1|X)

    1

    0.6

    0.4

    current

    P(y3=1|X)

    1

    0.6

    0.4

    freqs1 <- freqs1 %>%
      mutate(x1_y=(p_x1*joint_x1)/joint_y,
             x2_y=(p_x2*joint_x2)/joint_y)
    
    flextable(freqs1)

    val

    frq

    perc

    iteration

    joint_x1

    joint_x2

    joint_y

    x1_y

    x2_y

    1-1-1

    696

    0.406

    1

    0.216

    0.064

    0.14

    0.771

    0.229

    1-1-2

    68

    0.040

    1

    0.144

    0.096

    0.12

    0.600

    0.400

    1-2-1

    275

    0.160

    1

    0.144

    0.096

    0.12

    0.600

    0.400

    1-2-2

    130

    0.076

    1

    0.096

    0.144

    0.12

    0.400

    0.600

    2-1-1

    34

    0.020

    1

    0.144

    0.096

    0.12

    0.600

    0.400

    2-1-2

    19

    0.011

    1

    0.096

    0.144

    0.12

    0.400

    0.600

    2-2-1

    125

    0.073

    1

    0.096

    0.144

    0.12

    0.400

    0.600

    2-2-2

    366

    0.214

    1

    0.064

    0.216

    0.14

    0.229

    0.771

    5. Estimate frequency counts of estimated components

    Using the values of our initial parameter estimates, we have generated estimates of the probability of the respondent belonging to a given component, given their response pattern on the indicator variables. We can now use these estimated probabilities to estimate the number of respondents we expect to belong to each component.

    freqs1 <- freqs1 %>%
      mutate(frq_x1=x1_y*frq,
             frq_x2=x2_y*frq)
    
    flextable(freqs1)

    val

    frq

    perc

    iteration

    joint_x1

    joint_x2

    joint_y

    x1_y

    x2_y

    frq_x1

    frq_x2

    1-1-1

    696

    0.406

    1

    0.216

    0.064

    0.14

    0.771

    0.229

    536.9

    159.1

    1-1-2

    68

    0.040

    1

    0.144

    0.096

    0.12

    0.600

    0.400

    40.8

    27.2

    1-2-1

    275

    0.160

    1

    0.144

    0.096

    0.12

    0.600

    0.400

    165.0

    110.0

    1-2-2

    130

    0.076

    1

    0.096

    0.144

    0.12

    0.400

    0.600

    52.0

    78.0

    2-1-1

    34

    0.020

    1

    0.144

    0.096

    0.12

    0.600

    0.400

    20.4

    13.6

    2-1-2

    19

    0.011

    1

    0.096

    0.144

    0.12

    0.400

    0.600

    7.6

    11.4

    2-2-1

    125

    0.073

    1

    0.096

    0.144

    0.12

    0.400

    0.600

    50.0

    75.0

    2-2-2

    366

    0.214

    1

    0.064

    0.216

    0.14

    0.229

    0.771

    83.7

    282.3

    Now that we have the predicted frequencies of cases exhibiting a particular response pattern for each component, we will use these probabilities to predict overall frequency counts of the parameters.

    x1_freqs <- sum(freqs1$frq_x1)
    #x1_freqs2
    
    x2_freqs <- sum(freqs1$frq_x2)
    #x2_freqs2
    
    f_y1_c1 <- sum(freqs1[1:4,10])
    #f1_y1_c1
    
    f_y1_c2 <- sum(freqs1[1:4,11])
    #f1_y1_c2
    
    f_y2_c1 <- sum(freqs1[c(1:2, 5:6),10])
    #f1_y2_c1
    
    f_y2_c2 <- sum(freqs1[c(1:2, 5:6), 11])
    #f1_y2_c2
    
    f_y3_c1 <- sum(freqs1[c(1,3,5,7), 10])
    #f1_y3_c1
    
    f_y3_c2 <- sum(freqs1[c(1,3,5,7), 11])
    #f1_y3_c2
    
    newfreqs <- data.frame(estimate="new",
                      probs=params1$probs,
                      iteration=1) %>%
                      mutate(X1=c(x1_freqs,
                      f_y1_c1,
                      f_y2_c1,
                      f_y3_c1),
                      X2=c(x2_freqs,
                      f_y1_c2,
                      f_y2_c2,
                      f_y3_c2))
                  
    newfreqs_flex <- newfreqs %>%
    flextable() 
    
    newfreqs_flex

    estimate

    probs

    iteration

    X1

    X2

    new

    P(X)

    1

    956

    757

    new

    P(y1=1|X)

    1

    795

    374

    new

    P(y2=1|X)

    1

    606

    211

    new

    P(y3=1|X)

    1

    772

    358

    6. Estimate new model parameters

    Now we use these predicted frequency counts to generate new parameter probabilities

    pnew_x1 <- x1_freqs / (x1_freqs + x2_freqs)
    #pnew_x1
    
    pnew_x2 <- x2_freqs / (x1_freqs + x2_freqs)
    #pnew_x2
    
    # y's for component 1
    
    y1_x1 <- f_y1_c1 / x1_freqs
    #y1_x1
    
    y2_x1 <- f_y2_c1 / x1_freqs
    #y2_x1
    
    y3_x1 <- f_y3_c1 / x1_freqs
    #y3_x1
    
    # y's for component 2
    
    y1_x2 <- f_y1_c2 / x2_freqs
    #y1_x2
    
    y2_x2 <- f_y2_c2 / x2_freqs
    #y2_x2
    
    y3_x2 <- f_y3_c2 / x2_freqs
    #y3_x2

    Arrange new model parameters into a table.

    new <- data.frame(estimate="new",
                      probs=params1$probs,
                      iteration=1) %>%
                      mutate(X1=c(pnew_x1,
                      y1_x1,
                      y2_x1,
                      y3_x1),
                      X2=c(pnew_x2,
                      y1_x2,
                      y2_x2,
                      y3_x2))
                  
    #flextable(new)
    
    params1 <- bind_rows(params1, new)
    
    flextable(params1) 

    estimate

    probs

    iteration

    X1

    X2

    current

    P(X)

    1

    0.500

    0.500

    current

    P(y1=1|X)

    1

    0.600

    0.400

    current

    P(y2=1|X)

    1

    0.600

    0.400

    current

    P(y3=1|X)

    1

    0.600

    0.400

    new

    P(X)

    1

    0.558

    0.442

    new

    P(y1=1|X)

    1

    0.831

    0.495

    new

    P(y2=1|X)

    1

    0.633

    0.279

    new

    P(y3=1|X)

    1

    0.808

    0.473

    Comparing initial to new estimates of model parameters, we see that the estimate of the share of component 1 in the sample increases to 56 percent, while the estimate of the share of component 2 in the sample decreases to 44 percent. Probability of belonging to component 1 increased for all indicator variables, and decreased for all indicator variables belonging to comnponent 2.

    What are we to make of this? There exists a proof that these new estimates better describe the observed data. We know this with certainty if we had known the hidden labels beforehand but blinded them in this analysis. But without the hidden knowledge, how do we know?

    We can take the log likelihood of our response patterns at each iteration to see whether we are arriving at some sort of ground-truth. It can be shown that the log likelihood is guaranteed to increase after each iteration of the algorithm.

    freqs1 <- freqs1 %>%
      mutate(ll=frq*log(joint_y))
    
    flextable(freqs1)

    val

    frq

    perc

    iteration

    joint_x1

    joint_x2

    joint_y

    x1_y

    x2_y

    frq_x1

    frq_x2

    ll

    1-1-1

    696

    0.406

    1

    0.216

    0.064

    0.14

    0.771

    0.229

    536.9

    159.1

    -1,368.4

    1-1-2

    68

    0.040

    1

    0.144

    0.096

    0.12

    0.600

    0.400

    40.8

    27.2

    -144.2

    1-2-1

    275

    0.160

    1

    0.144

    0.096

    0.12

    0.600

    0.400

    165.0

    110.0

    -583.1

    1-2-2

    130

    0.076

    1

    0.096

    0.144

    0.12

    0.400

    0.600

    52.0

    78.0

    -275.6

    2-1-1

    34

    0.020

    1

    0.144

    0.096

    0.12

    0.600

    0.400

    20.4

    13.6

    -72.1

    2-1-2

    19

    0.011

    1

    0.096

    0.144

    0.12

    0.400

    0.600

    7.6

    11.4

    -40.3

    2-2-1

    125

    0.073

    1

    0.096

    0.144

    0.12

    0.400

    0.600

    50.0

    75.0

    -265.0

    2-2-2

    366

    0.214

    1

    0.064

    0.216

    0.14

    0.229

    0.771

    83.7

    282.3

    -719.6

    ll_1 <- sum(freqs1$ll)
    ll_1
    [1] -3468

    Iterate the model

    Let’s now go through the iteration again, using the new model parameter estimates.

    params2 <- new %>%
      mutate(estimate="current",
             iteration=2)
    
    flextable(params2)

    estimate

    probs

    iteration

    X1

    X2

    current

    P(X)

    2

    0.558

    0.442

    current

    P(y1=1|X)

    2

    0.831

    0.495

    current

    P(y2=1|X)

    2

    0.633

    0.279

    current

    P(y3=1|X)

    2

    0.808

    0.473

    p_x1 <- params2[1,4] # share of component 1
    p_x2 <- params2[1,5]
    
    y1_x1 <- params2[2,4] # probability of y1 for component 1
    y2_x1 <- params2[3,4]
    y3_x1 <- params2[4,4]
    
    y1_x2 <- params2[2,5] # probability of y1 for component 2
    y2_x2 <- params2[3,5]
    y3_x2 <- params2[4,5]

    Probability of response patterns, if they belonged to component 1.

    x1_111 <- y1_x1*y2_x1*y3_x1 
    # probability of 1-1-1 belonging to component 1
    x1_112 <- y1_x1*y1_x1*(1-y3_x1)
    # probability of 1-1-2 belonging to component 1
    x1_121 <- y1_x1*(1-y2_x1)*y3_x1
    # probability of 1-2-1 belonging to component 1
    x1_122 <- y1_x1*(1-y2_x1)*(1-y3_x1)
    x1_211 <- (1-y1_x1)*y2_x1*y3_x1
    x1_212 <- (1-y1_x1)*y2_x1*(1-y3_x1)
    x1_221 <- (1-y1_x1)*(1-y2_x1)*(y3_x1)
    x1_222 <- (1-y1_x1)*(1-y2_x1)*(1-y3_x1)
    
    jointy_x1 <- c(x1_111, x1_112, x1_121, x1_122, x1_211, x1_212, x1_221, x1_222)

    Probability of response patterns, if they belonged to component 2.

    x2_111 <- y1_x2*y2_x2*y3_x2 
    # probability of 1-1-1 belonging to component 1
    x2_112 <- y1_x2*y1_x2*(1-y3_x2)
    # probability of 1-1-2 belonging to component 1
    x2_121 <- y1_x2*(1-y2_x2)*y3_x2
    # probability of 1-2-1 belonging to component 1
    x2_122 <- y1_x2*(1-y2_x2)*(1-y3_x2)
    x2_211 <- (1-y1_x2)*y2_x2*y3_x2
    x2_212 <- (1-y1_x2)*y2_x2*(1-y3_x2)
    x2_221 <- (1-y1_x2)*(1-y2_x2)*(y3_x2)
    x2_222 <- (1-y1_x2)*(1-y2_x2)*(1-y3_x2)
    
    jointy_x2 <- c(x2_111, x2_112, x2_121, x2_122, x2_211, x2_212, x2_221, x2_222)
    
    freqs2 <- freqs %>%
      mutate(iteration=2,
             joint_x1=jointy_x1,
             joint_x2=jointy_x2,
             joint_y=joint_x1*p_x1 + joint_x2*p_x2,
             x1_y=(p_x1*joint_x1)/joint_y,
             x2_y=(p_x2*joint_x2)/joint_y,
             frq_x1=x1_y*frq,
             frq_x2=x2_y*frq)
    
    flextable(freqs2)

    val

    frq

    perc

    iteration

    joint_x1

    joint_x2

    joint_y

    x1_y

    x2_y

    frq_x1

    frq_x2

    1-1-1

    696

    0.406

    2

    0.4250

    0.0653

    0.2661

    0.8916

    0.108

    620.56

    75.4

    1-1-2

    68

    0.040

    2

    0.1329

    0.1290

    0.1312

    0.5656

    0.434

    38.46

    29.5

    1-2-1

    275

    0.160

    2

    0.2460

    0.1685

    0.2118

    0.6485

    0.351

    178.34

    96.7

    1-2-2

    130

    0.076

    2

    0.0586

    0.1880

    0.1158

    0.2828

    0.717

    36.76

    93.2

    2-1-1

    34

    0.020

    2

    0.0865

    0.0667

    0.0777

    0.6209

    0.379

    21.11

    12.9

    2-1-2

    19

    0.011

    2

    0.0206

    0.0744

    0.0444

    0.2593

    0.741

    4.93

    14.1

    2-2-1

    125

    0.073

    2

    0.0500

    0.1722

    0.1040

    0.2687

    0.731

    33.59

    91.4

    2-2-2

    366

    0.214

    2

    0.0119

    0.1920

    0.0915

    0.0728

    0.927

    26.64

    339.4

    x1_freqs <- sum(freqs2$frq_x1)
    #x1_freqs2
    
    x2_freqs <- sum(freqs2$frq_x2)
    #x2_freqs2
    
    f_y1_c1 <- sum(freqs2[1:4,10])
    #f1_y1_c1
    
    f_y1_c2 <- sum(freqs2[1:4,11])
    #f1_y1_c2
    
    f_y2_c1 <- sum(freqs2[c(1:2, 5:6),10])
    #f1_y2_c1
    
    f_y2_c2 <- sum(freqs2[c(1:2, 5:6), 11])
    #f1_y2_c2
    
    f_y3_c1 <- sum(freqs2[c(1,3,5,7), 10])
    #f1_y3_c1
    
    f_y3_c2 <- sum(freqs2[c(1,3,5,7), 11])
    #f1_y3_c2
    
    newfreqs <- data.frame(estimate="new",
                      probs=params2$probs,
                      iteration=2) %>%
                      mutate(X1=c(x1_freqs,
                      f_y1_c1,
                      f_y2_c1,
                      f_y3_c1),
                      X2=c(x2_freqs,
                      f_y1_c2,
                      f_y2_c2,
                      f_y3_c2))
                  
    newfreqs_flex <- newfreqs %>%
    flextable() 
    
    newfreqs_flex

    estimate

    probs

    iteration

    X1

    X2

    new

    P(X)

    2

    960

    753

    new

    P(y1=1|X)

    2

    874

    295

    new

    P(y2=1|X)

    2

    685

    132

    new

    P(y3=1|X)

    2

    854

    276

    pnew_x1 <- x1_freqs / (x1_freqs + x2_freqs)
    #pnew_x1
    
    pnew_x2 <- x2_freqs / (x1_freqs + x2_freqs)
    #pnew_x2
    
    # y's for component 1
    
    y1_x1 <- f_y1_c1 / x1_freqs
    #y1_x1
    
    y2_x1 <- f_y2_c1 / x1_freqs
    #y2_x1
    
    y3_x1 <- f_y3_c1 / x1_freqs
    #y3_x1
    
    # y's for component 2
    
    y1_x2 <- f_y1_c2 / x2_freqs
    #y1_x2
    
    y2_x2 <- f_y2_c2 / x2_freqs
    #y2_x2
    
    y3_x2 <- f_y3_c2 / x2_freqs
    #y3_x2

    Arrange new model parameters into a table.

    new <- data.frame(estimate="new",
                      probs=params1$probs[1:4],
                      iteration=2) %>%
                      mutate(X1=c(pnew_x1,
                      y1_x1,
                      y2_x1,
                      y3_x1),
                      X2=c(pnew_x2,
                      y1_x2,
                      y2_x2,
                      y3_x2))
                  
    #flextable(new)
    
    params2 <- bind_rows(params2, new)
    
    flextable(params2) 

    estimate

    probs

    iteration

    X1

    X2

    current

    P(X)

    2

    0.558

    0.442

    current

    P(y1=1|X)

    2

    0.831

    0.495

    current

    P(y2=1|X)

    2

    0.633

    0.279

    current

    P(y3=1|X)

    2

    0.808

    0.473

    new

    P(X)

    2

    0.561

    0.439

    new

    P(y1=1|X)

    2

    0.910

    0.392

    new

    P(y2=1|X)

    2

    0.713

    0.175

    new

    P(y3=1|X)

    2

    0.889

    0.367

    As with iteration 1, the probability of component 1 has increased, while the probability for component 2 has decreased. Item agreement is more associated with component 1 than component 2.

    Now let’s look at the log likelihood.

    freqs2 <- freqs2 %>%
      mutate(ll=frq*log(joint_y))
    
    flextable(freqs2)

    val

    frq

    perc

    iteration

    joint_x1

    joint_x2

    joint_y

    x1_y

    x2_y

    frq_x1

    frq_x2

    ll

    1-1-1

    696

    0.406

    2

    0.4250

    0.0653

    0.2661

    0.8916

    0.108

    620.56

    75.4

    -921.4

    1-1-2

    68

    0.040

    2

    0.1329

    0.1290

    0.1312

    0.5656

    0.434

    38.46

    29.5

    -138.1

    1-2-1

    275

    0.160

    2

    0.2460

    0.1685

    0.2118

    0.6485

    0.351

    178.34

    96.7

    -426.8

    1-2-2

    130

    0.076

    2

    0.0586

    0.1880

    0.1158

    0.2828

    0.717

    36.76

    93.2

    -280.3

    2-1-1

    34

    0.020

    2

    0.0865

    0.0667

    0.0777

    0.6209

    0.379

    21.11

    12.9

    -86.9

    2-1-2

    19

    0.011

    2

    0.0206

    0.0744

    0.0444

    0.2593

    0.741

    4.93

    14.1

    -59.2

    2-2-1

    125

    0.073

    2

    0.0500

    0.1722

    0.1040

    0.2687

    0.731

    33.59

    91.4

    -282.9

    2-2-2

    366

    0.214

    2

    0.0119

    0.1920

    0.0915

    0.0728

    0.927

    26.64

    339.4

    -875.3

    ll_2 <- sum(freqs2$ll)
    ll_1
    [1] -3468
    ll_2
    [1] -3071

    The overall log likelihood has increased, which means we are closer to our convergance (maximization of the likelihood of the data).

    These iterations continue, with each iteration guaranteed to increase the log likelihood of the model. When the iterations result in negligible increases to the log likelihood, the algorithm is said to have converged and we have arrived at a solution.

    It’s possible that the algorithm identified a local maximum, rather than a global maximum. Initialize a few different random starts to see if you get to the same solution. If you do not, take the solution with the highest log likelihood.