107: Latent Class Models

In this example, we will replicate the latent class example model from Biogeme.

import larch
import pandas as pd
from larch.roles import P,X

The swissmetro dataset used in this example is conveniently bundled with Larch, accessible using the data_warehouse module. We’ll load this file using the pandas read_csv command.

from larch import data_warehouse
raw = pd.read_csv(larch.data_warehouse.example_file('swissmetro.csv.gz'))

We can inspect a few rows of data to see what we have using the head method.

raw.head()
GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE ... TRAIN_TT TRAIN_CO TRAIN_HE SM_TT SM_CO SM_HE SM_SEATS CAR_TT CAR_CO CHOICE
0 2 0 1 1 1 0 1 1 0 3 ... 112 48 120 63 52 20 0 117 65 2
1 2 0 1 1 1 0 1 1 0 3 ... 103 48 30 60 49 10 0 117 84 2
2 2 0 1 1 1 0 1 1 0 3 ... 130 48 60 67 58 30 0 117 52 2
3 2 0 1 1 1 0 1 1 0 3 ... 103 40 30 63 52 20 0 72 52 2
4 2 0 1 1 1 0 1 1 0 3 ... 130 36 60 63 42 20 0 90 84 2

5 rows × 28 columns

The Biogeme code includes a variety of commands to manipulate the data and create new variables. Because Larch sits on top of pandas, a reasonable method to create new variables is to just create new columns in the source pandas.DataFrame in the usual manner for any DataFrame.

raw['SM_COST'] = raw['SM_CO'] * (raw["GA"]==0) 

You can also use the eval method of pandas DataFrames. This method takes an expression as a string and evaluates it within a namespace that has already loaded the column names as variables.

raw['TRAIN_COST'] = raw.eval("TRAIN_CO * (GA == 0)") 

This can allow for writing data expressions more succinctly, as long as all your variable names are strings that can also be the names of variables in Python. If this isn’t the case (e.g., if any variable names have spaces in the name) you’ll be better off if you stay away from this feature.

We can mix and match between these two method to create new columns in any DataFrame as needed.

raw['TRAIN_COST_SCALED'] = raw['TRAIN_COST'] / 100
raw['TRAIN_TT_SCALED'] = raw['TRAIN_TT'] / 100

raw['SM_COST_SCALED'] = raw.eval('SM_COST / 100')
raw['SM_TT_SCALED'] = raw['SM_TT'] / 100

raw['CAR_CO_SCALED'] = raw['CAR_CO'] / 100
raw['CAR_TT_SCALED'] = raw['CAR_TT'] / 100
raw['CAR_AV_SP'] = raw.eval("CAR_AV * (SP!=0)")
raw['TRAIN_AV_SP'] = raw.eval("TRAIN_AV * (SP!=0)")

Removing some observations can also be done directly using pandas. Here we identify a subset of observations that we want to keep.

keep = raw.eval("PURPOSE in (1,3) and CHOICE != 0")

You may note that we don’t assign this value to a column within the raw DataFrame. This is perfectly acceptable, as the output from the eval method is just a normal pandas.Series, like any other single column output you might expect to get from a pandas method.

When you’ve created the data you need, you can pass the dataframe to the larch.DataFrames constructor. Since the swissmetro data is in idco format, we’ll need to explicitly identify the alternative codes as well.

dfs = larch.DataFrames(raw[keep], alt_codes=[1,2,3])

The info method of the DataFrames object gives a short summary of the contents.

dfs.info()
larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co: 38 variables

A longer summary is available by setting verbose to True.

dfs.info(verbose=True)
larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co:
    - GROUP             (6768 non-null int64)
    - SURVEY            (6768 non-null int64)
    - SP                (6768 non-null int64)
    - ID                (6768 non-null int64)
    - PURPOSE           (6768 non-null int64)
    - FIRST             (6768 non-null int64)
    - TICKET            (6768 non-null int64)
    - WHO               (6768 non-null int64)
    - LUGGAGE           (6768 non-null int64)
    - AGE               (6768 non-null int64)
    - MALE              (6768 non-null int64)
    - INCOME            (6768 non-null int64)
    - GA                (6768 non-null int64)
    - ORIGIN            (6768 non-null int64)
    - DEST              (6768 non-null int64)
    - TRAIN_AV          (6768 non-null int64)
    - CAR_AV            (6768 non-null int64)
    - SM_AV             (6768 non-null int64)
    - TRAIN_TT          (6768 non-null int64)
    - TRAIN_CO          (6768 non-null int64)
    - TRAIN_HE          (6768 non-null int64)
    - SM_TT             (6768 non-null int64)
    - SM_CO             (6768 non-null int64)
    - SM_HE             (6768 non-null int64)
    - SM_SEATS          (6768 non-null int64)
    - CAR_TT            (6768 non-null int64)
    - CAR_CO            (6768 non-null int64)
    - CHOICE            (6768 non-null int64)
    - SM_COST           (6768 non-null int64)
    - TRAIN_COST        (6768 non-null int64)
    - TRAIN_COST_SCALED (6768 non-null float64)
    - TRAIN_TT_SCALED   (6768 non-null float64)
    - SM_COST_SCALED    (6768 non-null float64)
    - SM_TT_SCALED      (6768 non-null float64)
    - CAR_CO_SCALED     (6768 non-null float64)
    - CAR_TT_SCALED     (6768 non-null float64)
    - CAR_AV_SP         (6768 non-null int64)
    - TRAIN_AV_SP       (6768 non-null int64)

You may have noticed that the info summary notes that this data is “not computation-ready”. That’s because some of the data columns are stored as integers, which can be observed by inspecting the info on the data_co dataframe.

dfs.data_co.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6768 entries, 0 to 8450
Data columns (total 38 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   GROUP              6768 non-null   int64  
 1   SURVEY             6768 non-null   int64  
 2   SP                 6768 non-null   int64  
 3   ID                 6768 non-null   int64  
 4   PURPOSE            6768 non-null   int64  
 5   FIRST              6768 non-null   int64  
 6   TICKET             6768 non-null   int64  
 7   WHO                6768 non-null   int64  
 8   LUGGAGE            6768 non-null   int64  
 9   AGE                6768 non-null   int64  
 10  MALE               6768 non-null   int64  
 11  INCOME             6768 non-null   int64  
 12  GA                 6768 non-null   int64  
 13  ORIGIN             6768 non-null   int64  
 14  DEST               6768 non-null   int64  
 15  TRAIN_AV           6768 non-null   int64  
 16  CAR_AV             6768 non-null   int64  
 17  SM_AV              6768 non-null   int64  
 18  TRAIN_TT           6768 non-null   int64  
 19  TRAIN_CO           6768 non-null   int64  
 20  TRAIN_HE           6768 non-null   int64  
 21  SM_TT              6768 non-null   int64  
 22  SM_CO              6768 non-null   int64  
 23  SM_HE              6768 non-null   int64  
 24  SM_SEATS           6768 non-null   int64  
 25  CAR_TT             6768 non-null   int64  
 26  CAR_CO             6768 non-null   int64  
 27  CHOICE             6768 non-null   int64  
 28  SM_COST            6768 non-null   int64  
 29  TRAIN_COST         6768 non-null   int64  
 30  TRAIN_COST_SCALED  6768 non-null   float64
 31  TRAIN_TT_SCALED    6768 non-null   float64
 32  SM_COST_SCALED     6768 non-null   float64
 33  SM_TT_SCALED       6768 non-null   float64
 34  CAR_CO_SCALED      6768 non-null   float64
 35  CAR_TT_SCALED      6768 non-null   float64
 36  CAR_AV_SP          6768 non-null   int64  
 37  TRAIN_AV_SP        6768 non-null   int64  
dtypes: float64(6), int64(32)
memory usage: 2.0 MB

When computations are run, we’ll need all the data to be in float format, but Larch knows this and will handle it for you later.

Class Model Setup

Having prepped our data, we’re ready to set up discrete choices models for each class in the latent class model. We’ll reproduce the Biogeme example exactly here, as a technology demonstation. Each of two classes will be set up with a simple MNL model.

m1 = larch.Model(dataservice=dfs)
m1.availability_co_vars = {
    1: "TRAIN_AV_SP",
    2: "SM_AV",
    3: "CAR_AV_SP",
}
m1.choice_co_code = 'CHOICE'

m1.utility_co[1] = P("ASC_TRAIN") + X("TRAIN_COST_SCALED") * P("B_COST")
m1.utility_co[2] = X("SM_COST_SCALED") * P("B_COST")
m1.utility_co[3] = P("ASC_CAR") + X("CAR_CO_SCALED") * P("B_COST")
m2 = larch.Model(dataservice=dfs)
m2.availability_co_vars = {
    1: "TRAIN_AV_SP",
    2: "SM_AV",
    3: "CAR_AV_SP",
}
m2.choice_co_code = 'CHOICE'

m2.utility_co[1] = P("ASC_TRAIN") + X("TRAIN_TT_SCALED") * P("B_TIME") + X("TRAIN_COST_SCALED") * P("B_COST")
m2.utility_co[2] = X("SM_TT_SCALED") * P("B_TIME") + X("SM_COST_SCALED") * P("B_COST")
m2.utility_co[3] = P("ASC_CAR") + X("CAR_TT_SCALED") * P("B_TIME") + X("CAR_CO_SCALED") * P("B_COST")

Class Membership Model

For Larch, the class membership model will be set up as yet another discrete choice model. In this case, the choices are not the ultimate choices, but instead are the latent classes. To remain consistent with the Biogeme example, we’ll set up this model with only a single constant that determines class membership. Unlike Biogeme, this class membership will be represented with an MNL model, not a simple direct probability.

mk = larch.Model()
mk.utility_co[2] = P("W_OTHER")

The utility function of the first class isn’t written here, which means it will implicitly be set as 0.

Latent Class Model

Now we’re ready to create the latent class model itself, by assembling the components we created above. The constructor for the LatentClassModel takes two arguments, a class membership model, and a dictionary of class models, where the keys in the dictionary correspond to the identifying codes from the utility functions we wrote for the class membership model.

from larch.model.latentclass import LatentClassModel
m = LatentClassModel(mk, {1:m1, 2:m2})

The we’ll load the data needed for our models using the load_data method. This step will assemble the data needed, and convert it to floating point format as required.

m.load_data()
m.dataframes.info(verbose=1)
larch.DataFrames:
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co:
    - CAR_CO_SCALED     (6768 non-null float64)
    - CAR_TT_SCALED     (6768 non-null float64)
    - SM_COST_SCALED    (6768 non-null float64)
    - SM_TT_SCALED      (6768 non-null float64)
    - TRAIN_COST_SCALED (6768 non-null float64)
    - TRAIN_TT_SCALED   (6768 non-null float64)
  data_av: <populated>
  data_ch: <populated>

Only the data actually needed by the models has been converted, which may help keep memory usage down on larger models. You may also note that the loaded dataframes no longer reports that it is “not computational-ready”.

To estimate the model, we’ll use the maximize_loglike method. When run in Jupyter, a live-view report of the parmeters and log likelihood is displayed.

result = m.maximize_loglike()

Iteration 056 [Optimization terminated successfully.]

Best LL = -5208.498065961453

value initvalue nullvalue minimum maximum holdfast note best
ASC_CAR 0.124611 0.0 0.0 -inf inf 0 0.124611
ASC_TRAIN -0.397611 0.0 0.0 -inf inf 0 -0.397611
B_COST -1.263567 0.0 0.0 -inf inf 0 -1.263567
B_TIME -2.797798 0.0 0.0 -inf inf 0 -2.797798
W_OTHER 1.094291 0.0 0.0 -inf inf 0 1.094291
/home/runner/work/larch/larch/larch/larch/linalg/__init__.py:18: UserWarning: minimum eig 0.0 in general_inverse
  warnings.warn(f"minimum eig {min_eig} in general_inverse")
result
keyvalue
loglike-5208.498065961453
x
0
ASC_CAR 0.124611
ASC_TRAIN -0.397611
B_COST -1.263567
B_TIME -2.797798
W_OTHER 1.094291
tolerance8.736059377236486e-06
steps
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1.])
message'Optimization terminated successfully.'
elapsed_time0:00:04.948052
method'bhhh'
n_cases6768
iteration_number56
logloss0.769577137405652

To complete our analysis, we can compute the log likelihood at “null” parameters.

m.loglike_null()
-6964.662979191462

And the parameter covariance matrixes.

m.calculate_parameter_covariance()
m.covariance_matrix
ASC_CAR ASC_TRAIN B_COST B_TIME W_OTHER
ASC_CAR 0.002549 0.001935 0.000252 -0.004538 -0.000080
ASC_TRAIN 0.001935 0.003702 -0.000143 -0.004348 0.001439
B_COST 0.000252 -0.000143 0.003742 0.002839 0.000620
B_TIME -0.004538 -0.004348 0.002839 0.030840 0.012869
W_OTHER -0.000080 0.001439 0.000620 0.012869 0.013563
m.robust_covariance_matrix
ASC_CAR ASC_TRAIN B_COST B_TIME W_OTHER
ASC_CAR 0.010295 0.008356 0.001083 -0.018769 -0.001150
ASC_TRAIN 0.008356 0.015392 -0.001147 -0.017526 0.006384
B_COST 0.001083 -0.001147 0.029295 0.014484 -0.000933
B_TIME -0.018770 -0.017526 0.014483 0.117864 0.047962
W_OTHER -0.001150 0.006384 -0.000933 0.047962 0.053560

Reporting Results

And then generate a report of the estimation statistics. Larch includes a Reporter class to help you assemble a report containing the relevant output you want.

report = larch.Reporter("Latent Class Example")

Pipe into the report section headers in markdown format (use one hash for top level headings, two hashes for lower levels, etc.)

report << "# Parameter Estimates"

Parameter Estimates

You can also pipe in dataframes directly, include the pf parameter frame from the model.

report << m.pf
value initvalue nullvalue minimum maximum holdfast note best std_err t_stat robust_std_err robust_t_stat
ASC_CAR 0.124611 0.0 0.0 -inf inf 0 0.124611 0.050483 2.468380 0.101465 1.228109
ASC_TRAIN -0.397611 0.0 0.0 -inf inf 0 -0.397611 0.060847 -6.534638 0.124063 -3.204911
B_COST -1.263567 0.0 0.0 -inf inf 0 -1.263567 0.061170 -20.656769 0.171159 -7.382413
B_TIME -2.797798 0.0 0.0 -inf inf 0 -2.797798 0.175613 -15.931613 0.343313 -8.149404
W_OTHER 1.094291 0.0 0.0 -inf inf 0 1.094291 0.116461 9.396198 0.231431 4.728378

And a selection of pre-formatted summary sections.

report << "# Estimation Statistics"
report << m.estimation_statistics()

Estimation Statistics

StatisticAggregatePer Case
Number of Cases6768
Log Likelihood at Convergence-5208.50-0.77
Log Likelihood at Null Parameters-6964.66-1.03
Rho Squared w.r.t. Null Parameters0.252
report << "# Parameter Covariance"
report << "## Typical Parameter Covariance"
report << m.covariance_matrix
report << "## Robust Parameter Covariance"
report << m.robust_covariance_matrix

Parameter Covariance

Typical Parameter Covariance

ASC_CAR ASC_TRAIN B_COST B_TIME W_OTHER
ASC_CAR 0.002549 0.001935 0.000252 -0.004538 -0.000080
ASC_TRAIN 0.001935 0.003702 -0.000143 -0.004348 0.001439
B_COST 0.000252 -0.000143 0.003742 0.002839 0.000620
B_TIME -0.004538 -0.004348 0.002839 0.030840 0.012869
W_OTHER -0.000080 0.001439 0.000620 0.012869 0.013563

Robust Parameter Covariance

ASC_CAR ASC_TRAIN B_COST B_TIME W_OTHER
ASC_CAR 0.010295 0.008356 0.001083 -0.018769 -0.001150
ASC_TRAIN 0.008356 0.015392 -0.001147 -0.017526 0.006384
B_COST 0.001083 -0.001147 0.029295 0.014484 -0.000933
B_TIME -0.018770 -0.017526 0.014483 0.117864 0.047962
W_OTHER -0.001150 0.006384 -0.000933 0.047962 0.053560
report << "# Utility Functions"
report << "## Class 1"
report << "### Formulae"
report << m1.utility_functions(resolve_parameters=False)
report << "### Final Estimated Values"
report << m1.utility_functions(resolve_parameters=True)

Utility Functions

Class 1

Formulae

altformula
1
+ P.ASC_TRAIN
+ P.B_COST * X.TRAIN_COST_SCALED
2
+ P.B_COST * X.SM_COST_SCALED
3
+ P.ASC_CAR
+ P.B_COST * X.CAR_CO_SCALED

Final Estimated Values

altformula
1
+ -0.398
+ -1.26 * X.TRAIN_COST_SCALED
2
+ -1.26 * X.SM_COST_SCALED
3
+ 0.125
+ -1.26 * X.CAR_CO_SCALED
report << "## Class 2"
report << "### Formulae"
report << m1.utility_functions(resolve_parameters=False)
report << "### Final Estimated Values"
report << m1.utility_functions(resolve_parameters=True)

Class 2

Formulae

altformula
1
+ P.ASC_TRAIN
+ P.B_COST * X.TRAIN_COST_SCALED
2
+ P.B_COST * X.SM_COST_SCALED
3
+ P.ASC_CAR
+ P.B_COST * X.CAR_CO_SCALED

Final Estimated Values

altformula
1
+ -0.398
+ -1.26 * X.TRAIN_COST_SCALED
2
+ -1.26 * X.SM_COST_SCALED
3
+ 0.125
+ -1.26 * X.CAR_CO_SCALED

In addition to reviewing report sections in a Jupyter notebook, the entire report can be saved to an HTML file.

report.save('latent-class-example-report.html', overwrite=True)
'latent-class-example-report.html'