107: Latent Class Models
Contents
107: Latent Class Models¶
In this example, we will replicate the latent class example model from Biogeme.
import larch
import pandas as pd
from larch.roles import P,X
The swissmetro dataset used in this example is conveniently bundled with Larch,
accessible using the data_warehouse
module. We’ll load this file using
the pandas read_csv
command.
from larch import data_warehouse
raw = pd.read_csv(larch.data_warehouse.example_file('swissmetro.csv.gz'))
We can inspect a few rows of data to see what we have using the head
method.
raw.head()
GROUP | SURVEY | SP | ID | PURPOSE | FIRST | TICKET | WHO | LUGGAGE | AGE | ... | TRAIN_TT | TRAIN_CO | TRAIN_HE | SM_TT | SM_CO | SM_HE | SM_SEATS | CAR_TT | CAR_CO | CHOICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 112 | 48 | 120 | 63 | 52 | 20 | 0 | 117 | 65 | 2 |
1 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 103 | 48 | 30 | 60 | 49 | 10 | 0 | 117 | 84 | 2 |
2 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 130 | 48 | 60 | 67 | 58 | 30 | 0 | 117 | 52 | 2 |
3 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 103 | 40 | 30 | 63 | 52 | 20 | 0 | 72 | 52 | 2 |
4 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 130 | 36 | 60 | 63 | 42 | 20 | 0 | 90 | 84 | 2 |
5 rows × 28 columns
The Biogeme code includes a variety of commands to manipulate the data
and create new variables. Because Larch sits on top of pandas, a reasonable
method to create new variables is to just create new columns in the
source pandas.DataFrame
in the usual manner for any DataFrame
.
raw['SM_COST'] = raw['SM_CO'] * (raw["GA"]==0)
You can also use the eval
method of pandas DataFrames.
This method takes an expression as a string
and evaluates it within a namespace that has already loaded the
column names as variables.
raw['TRAIN_COST'] = raw.eval("TRAIN_CO * (GA == 0)")
This can allow for writing data expressions more succinctly, as long as all your variable names are strings that can also be the names of variables in Python. If this isn’t the case (e.g., if any variable names have spaces in the name) you’ll be better off if you stay away from this feature.
We can mix and match between these two method to create new columns in any DataFrame as needed.
raw['TRAIN_COST_SCALED'] = raw['TRAIN_COST'] / 100
raw['TRAIN_TT_SCALED'] = raw['TRAIN_TT'] / 100
raw['SM_COST_SCALED'] = raw.eval('SM_COST / 100')
raw['SM_TT_SCALED'] = raw['SM_TT'] / 100
raw['CAR_CO_SCALED'] = raw['CAR_CO'] / 100
raw['CAR_TT_SCALED'] = raw['CAR_TT'] / 100
raw['CAR_AV_SP'] = raw.eval("CAR_AV * (SP!=0)")
raw['TRAIN_AV_SP'] = raw.eval("TRAIN_AV * (SP!=0)")
Removing some observations can also be done directly using pandas. Here we identify a subset of observations that we want to keep.
keep = raw.eval("PURPOSE in (1,3) and CHOICE != 0")
You may note that we don’t assign this value to a column within the
raw
DataFrame. This is perfectly acceptable, as the output from
the eval
method is just a normal pandas.Series, like any other
single column output you might expect to get from a pandas method.
When you’ve created the data you need, you can pass the dataframe to
the larch.DataFrames constructor. Since the swissmetro data is in
idco
format, we’ll need to explicitly identify the alternative
codes as well.
dfs = larch.DataFrames(raw[keep], alt_codes=[1,2,3])
The info method of the DataFrames object gives a short summary of the contents.
dfs.info()
larch.DataFrames: (not computation-ready)
n_cases: 6768
n_alts: 3
data_ca: <not populated>
data_co: 38 variables
A longer summary is available by setting verbose to True.
dfs.info(verbose=True)
larch.DataFrames: (not computation-ready)
n_cases: 6768
n_alts: 3
data_ca: <not populated>
data_co:
- GROUP (6768 non-null int64)
- SURVEY (6768 non-null int64)
- SP (6768 non-null int64)
- ID (6768 non-null int64)
- PURPOSE (6768 non-null int64)
- FIRST (6768 non-null int64)
- TICKET (6768 non-null int64)
- WHO (6768 non-null int64)
- LUGGAGE (6768 non-null int64)
- AGE (6768 non-null int64)
- MALE (6768 non-null int64)
- INCOME (6768 non-null int64)
- GA (6768 non-null int64)
- ORIGIN (6768 non-null int64)
- DEST (6768 non-null int64)
- TRAIN_AV (6768 non-null int64)
- CAR_AV (6768 non-null int64)
- SM_AV (6768 non-null int64)
- TRAIN_TT (6768 non-null int64)
- TRAIN_CO (6768 non-null int64)
- TRAIN_HE (6768 non-null int64)
- SM_TT (6768 non-null int64)
- SM_CO (6768 non-null int64)
- SM_HE (6768 non-null int64)
- SM_SEATS (6768 non-null int64)
- CAR_TT (6768 non-null int64)
- CAR_CO (6768 non-null int64)
- CHOICE (6768 non-null int64)
- SM_COST (6768 non-null int64)
- TRAIN_COST (6768 non-null int64)
- TRAIN_COST_SCALED (6768 non-null float64)
- TRAIN_TT_SCALED (6768 non-null float64)
- SM_COST_SCALED (6768 non-null float64)
- SM_TT_SCALED (6768 non-null float64)
- CAR_CO_SCALED (6768 non-null float64)
- CAR_TT_SCALED (6768 non-null float64)
- CAR_AV_SP (6768 non-null int64)
- TRAIN_AV_SP (6768 non-null int64)
You may have noticed that the info summary notes that this data is “not computation-ready”. That’s because some of the data columns are stored as integers, which can be observed by inspecting the info on the data_co dataframe.
dfs.data_co.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6768 entries, 0 to 8450
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GROUP 6768 non-null int64
1 SURVEY 6768 non-null int64
2 SP 6768 non-null int64
3 ID 6768 non-null int64
4 PURPOSE 6768 non-null int64
5 FIRST 6768 non-null int64
6 TICKET 6768 non-null int64
7 WHO 6768 non-null int64
8 LUGGAGE 6768 non-null int64
9 AGE 6768 non-null int64
10 MALE 6768 non-null int64
11 INCOME 6768 non-null int64
12 GA 6768 non-null int64
13 ORIGIN 6768 non-null int64
14 DEST 6768 non-null int64
15 TRAIN_AV 6768 non-null int64
16 CAR_AV 6768 non-null int64
17 SM_AV 6768 non-null int64
18 TRAIN_TT 6768 non-null int64
19 TRAIN_CO 6768 non-null int64
20 TRAIN_HE 6768 non-null int64
21 SM_TT 6768 non-null int64
22 SM_CO 6768 non-null int64
23 SM_HE 6768 non-null int64
24 SM_SEATS 6768 non-null int64
25 CAR_TT 6768 non-null int64
26 CAR_CO 6768 non-null int64
27 CHOICE 6768 non-null int64
28 SM_COST 6768 non-null int64
29 TRAIN_COST 6768 non-null int64
30 TRAIN_COST_SCALED 6768 non-null float64
31 TRAIN_TT_SCALED 6768 non-null float64
32 SM_COST_SCALED 6768 non-null float64
33 SM_TT_SCALED 6768 non-null float64
34 CAR_CO_SCALED 6768 non-null float64
35 CAR_TT_SCALED 6768 non-null float64
36 CAR_AV_SP 6768 non-null int64
37 TRAIN_AV_SP 6768 non-null int64
dtypes: float64(6), int64(32)
memory usage: 2.0 MB
When computations are run, we’ll need all the data to be in float format, but Larch knows this and will handle it for you later.
Class Model Setup¶
Having prepped our data, we’re ready to set up discrete choices models for each class in the latent class model. We’ll reproduce the Biogeme example exactly here, as a technology demonstation. Each of two classes will be set up with a simple MNL model.
m1 = larch.Model(dataservice=dfs)
m1.availability_co_vars = {
1: "TRAIN_AV_SP",
2: "SM_AV",
3: "CAR_AV_SP",
}
m1.choice_co_code = 'CHOICE'
m1.utility_co[1] = P("ASC_TRAIN") + X("TRAIN_COST_SCALED") * P("B_COST")
m1.utility_co[2] = X("SM_COST_SCALED") * P("B_COST")
m1.utility_co[3] = P("ASC_CAR") + X("CAR_CO_SCALED") * P("B_COST")
m2 = larch.Model(dataservice=dfs)
m2.availability_co_vars = {
1: "TRAIN_AV_SP",
2: "SM_AV",
3: "CAR_AV_SP",
}
m2.choice_co_code = 'CHOICE'
m2.utility_co[1] = P("ASC_TRAIN") + X("TRAIN_TT_SCALED") * P("B_TIME") + X("TRAIN_COST_SCALED") * P("B_COST")
m2.utility_co[2] = X("SM_TT_SCALED") * P("B_TIME") + X("SM_COST_SCALED") * P("B_COST")
m2.utility_co[3] = P("ASC_CAR") + X("CAR_TT_SCALED") * P("B_TIME") + X("CAR_CO_SCALED") * P("B_COST")
Class Membership Model¶
For Larch, the class membership model will be set up as yet another discrete choice model. In this case, the choices are not the ultimate choices, but instead are the latent classes. To remain consistent with the Biogeme example, we’ll set up this model with only a single constant that determines class membership. Unlike Biogeme, this class membership will be represented with an MNL model, not a simple direct probability.
mk = larch.Model()
mk.utility_co[2] = P("W_OTHER")
The utility function of the first class isn’t written here, which means it will implicitly be set as 0.
Latent Class Model¶
Now we’re ready to create the latent class model itself, by assembling the components
we created above. The constructor for the LatentClassModel
takes two arguments,
a class membership model, and a dictionary of class models, where the keys in the
dictionary correspond to the identifying codes from the utility functions we wrote
for the class membership model.
from larch.model.latentclass import LatentClassModel
m = LatentClassModel(mk, {1:m1, 2:m2})
The we’ll load the data needed for our models using the load_data
method.
This step will assemble the data needed, and convert it to floating point
format as required.
m.load_data()
m.dataframes.info(verbose=1)
larch.DataFrames:
n_cases: 6768
n_alts: 3
data_ca: <not populated>
data_co:
- CAR_CO_SCALED (6768 non-null float64)
- CAR_TT_SCALED (6768 non-null float64)
- SM_COST_SCALED (6768 non-null float64)
- SM_TT_SCALED (6768 non-null float64)
- TRAIN_COST_SCALED (6768 non-null float64)
- TRAIN_TT_SCALED (6768 non-null float64)
data_av: <populated>
data_ch: <populated>
Only the data actually needed by the models has been converted, which may help keep memory usage down on larger models. You may also note that the loaded dataframes no longer reports that it is “not computational-ready”.
To estimate the model, we’ll use the maximize_loglike
method. When run
in Jupyter, a live-view report of the parmeters and log likelihood is displayed.
result = m.maximize_loglike()
Iteration 056 [Optimization terminated successfully.]
Best LL = -5208.498065961453
value | initvalue | nullvalue | minimum | maximum | holdfast | note | best | |
---|---|---|---|---|---|---|---|---|
ASC_CAR | 0.124611 | 0.0 | 0.0 | -inf | inf | 0 | 0.124611 | |
ASC_TRAIN | -0.397611 | 0.0 | 0.0 | -inf | inf | 0 | -0.397611 | |
B_COST | -1.263567 | 0.0 | 0.0 | -inf | inf | 0 | -1.263567 | |
B_TIME | -2.797798 | 0.0 | 0.0 | -inf | inf | 0 | -2.797798 | |
W_OTHER | 1.094291 | 0.0 | 0.0 | -inf | inf | 0 | 1.094291 |
/home/runner/work/larch/larch/larch/larch/linalg/__init__.py:18: UserWarning: minimum eig 0.0 in general_inverse
warnings.warn(f"minimum eig {min_eig} in general_inverse")
result
key | value | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
loglike | -5208.498065961453 | ||||||||||||
x |
| ||||||||||||
tolerance | 8.736059377236486e-06 | ||||||||||||
steps | array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) | ||||||||||||
message | 'Optimization terminated successfully.' | ||||||||||||
elapsed_time | 0:00:04.948052 | ||||||||||||
method | 'bhhh' | ||||||||||||
n_cases | 6768 | ||||||||||||
iteration_number | 56 | ||||||||||||
logloss | 0.769577137405652 |
To complete our analysis, we can compute the log likelihood at “null” parameters.
m.loglike_null()
-6964.662979191462
And the parameter covariance matrixes.
m.calculate_parameter_covariance()
m.covariance_matrix
ASC_CAR | ASC_TRAIN | B_COST | B_TIME | W_OTHER | |
---|---|---|---|---|---|
ASC_CAR | 0.002549 | 0.001935 | 0.000252 | -0.004538 | -0.000080 |
ASC_TRAIN | 0.001935 | 0.003702 | -0.000143 | -0.004348 | 0.001439 |
B_COST | 0.000252 | -0.000143 | 0.003742 | 0.002839 | 0.000620 |
B_TIME | -0.004538 | -0.004348 | 0.002839 | 0.030840 | 0.012869 |
W_OTHER | -0.000080 | 0.001439 | 0.000620 | 0.012869 | 0.013563 |
m.robust_covariance_matrix
ASC_CAR | ASC_TRAIN | B_COST | B_TIME | W_OTHER | |
---|---|---|---|---|---|
ASC_CAR | 0.010295 | 0.008356 | 0.001083 | -0.018769 | -0.001150 |
ASC_TRAIN | 0.008356 | 0.015392 | -0.001147 | -0.017526 | 0.006384 |
B_COST | 0.001083 | -0.001147 | 0.029295 | 0.014484 | -0.000933 |
B_TIME | -0.018770 | -0.017526 | 0.014483 | 0.117864 | 0.047962 |
W_OTHER | -0.001150 | 0.006384 | -0.000933 | 0.047962 | 0.053560 |
Reporting Results¶
And then generate a report of the estimation statistics. Larch includes a Reporter
class
to help you assemble a report containing the relevant output you want.
report = larch.Reporter("Latent Class Example")
Pipe into the report section headers in markdown format (use one hash for top level headings, two hashes for lower levels, etc.)
You can also pipe in dataframes directly, include the pf
parameter frame from the model.
report << m.pf
value | initvalue | nullvalue | minimum | maximum | holdfast | note | best | std_err | t_stat | robust_std_err | robust_t_stat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ASC_CAR | 0.124611 | 0.0 | 0.0 | -inf | inf | 0 | 0.124611 | 0.050483 | 2.468380 | 0.101465 | 1.228109 | |
ASC_TRAIN | -0.397611 | 0.0 | 0.0 | -inf | inf | 0 | -0.397611 | 0.060847 | -6.534638 | 0.124063 | -3.204911 | |
B_COST | -1.263567 | 0.0 | 0.0 | -inf | inf | 0 | -1.263567 | 0.061170 | -20.656769 | 0.171159 | -7.382413 | |
B_TIME | -2.797798 | 0.0 | 0.0 | -inf | inf | 0 | -2.797798 | 0.175613 | -15.931613 | 0.343313 | -8.149404 | |
W_OTHER | 1.094291 | 0.0 | 0.0 | -inf | inf | 0 | 1.094291 | 0.116461 | 9.396198 | 0.231431 | 4.728378 |
And a selection of pre-formatted summary sections.
report << "# Estimation Statistics"
report << m.estimation_statistics()
report << "# Parameter Covariance"
report << "## Typical Parameter Covariance"
report << m.covariance_matrix
report << "## Robust Parameter Covariance"
report << m.robust_covariance_matrix
Parameter Covariance
Typical Parameter Covariance
ASC_CAR | ASC_TRAIN | B_COST | B_TIME | W_OTHER | |
---|---|---|---|---|---|
ASC_CAR | 0.002549 | 0.001935 | 0.000252 | -0.004538 | -0.000080 |
ASC_TRAIN | 0.001935 | 0.003702 | -0.000143 | -0.004348 | 0.001439 |
B_COST | 0.000252 | -0.000143 | 0.003742 | 0.002839 | 0.000620 |
B_TIME | -0.004538 | -0.004348 | 0.002839 | 0.030840 | 0.012869 |
W_OTHER | -0.000080 | 0.001439 | 0.000620 | 0.012869 | 0.013563 |
Robust Parameter Covariance
ASC_CAR | ASC_TRAIN | B_COST | B_TIME | W_OTHER | |
---|---|---|---|---|---|
ASC_CAR | 0.010295 | 0.008356 | 0.001083 | -0.018769 | -0.001150 |
ASC_TRAIN | 0.008356 | 0.015392 | -0.001147 | -0.017526 | 0.006384 |
B_COST | 0.001083 | -0.001147 | 0.029295 | 0.014484 | -0.000933 |
B_TIME | -0.018770 | -0.017526 | 0.014483 | 0.117864 | 0.047962 |
W_OTHER | -0.001150 | 0.006384 | -0.000933 | 0.047962 | 0.053560 |
report << "# Utility Functions"
report << "## Class 1"
report << "### Formulae"
report << m1.utility_functions(resolve_parameters=False)
report << "### Final Estimated Values"
report << m1.utility_functions(resolve_parameters=True)
report << "## Class 2"
report << "### Formulae"
report << m1.utility_functions(resolve_parameters=False)
report << "### Final Estimated Values"
report << m1.utility_functions(resolve_parameters=True)
In addition to reviewing report sections in a Jupyter notebook, the entire report can be saved to an HTML file.
report.save('latent-class-example-report.html', overwrite=True)
'latent-class-example-report.html'