Machine Learning

Larch is (mostly) compatible with the scikit-learn stucture for machine learning. Within this structure, the larch.Model object can be used as an estimator and as a predictor.

Note this page applies to the legacy interface for Larch. Updates to enable these features for the numba-based version are coming eventually.

Using Larch within Scikit-Learn

import larch
import pandas as pd
from larch import PX, P, X
from larch.data_warehouse import example_file
df = pd.read_csv(example_file("MTCwork.csv.gz"))
df.set_index(['casenum','altnum'], inplace=True, drop=False)

To use the scikit-learn interface, we’ll need to define our model based exclusively on idca or idco format data. We do so here, although we don’t need to actually connect the model to the data yet.

m = larch.Model()

m.utility_ca = (
    PX('tottime') 
    + PX('totcost') 
    + sum(P(f'ASC_{i}') * X(f'altnum=={i}') for i in [2,3,4,5,6])
    + sum(P(f'HHINC#{i}') * X(f'(altnum=={i})*hhinc') for i in [2,3,4,5,6])
)

Because the larch.Model object is an estimator, if offers a fit method to estimate the fitted (likelihood maximizing) parameters. This method for model estimation takes a plain old pandas.DataFrame as the X input. Because this is a regular DataFrame, the data does not internally identify which column[s] contain the observed choice values, so that data must be explictly identified in the method call:

m.fit(df, y=df.chose)
req_data does not request avail_ca or avail_co but it is set and being provided

Iteration 010 [Optimization terminated successfully.]

Best LL = -3626.1862555129305

value initvalue nullvalue minimum maximum holdfast note best
ASC_2 -2.178014 0.0 0.0 -inf inf 0 -2.178014
ASC_3 -3.725078 0.0 0.0 -inf inf 0 -3.725078
ASC_4 -0.670861 0.0 0.0 -inf inf 0 -0.670861
ASC_5 -2.376328 0.0 0.0 -inf inf 0 -2.376328
ASC_6 -0.206775 0.0 0.0 -inf inf 0 -0.206775
HHINC#2 -0.002170 0.0 0.0 -inf inf 0 -0.002170
HHINC#3 0.000358 0.0 0.0 -inf inf 0 0.000358
HHINC#4 -0.005286 0.0 0.0 -inf inf 0 -0.005286
HHINC#5 -0.012808 0.0 0.0 -inf inf 0 -0.012808
HHINC#6 -0.009686 0.0 0.0 -inf inf 0 -0.009686
totcost -0.004920 0.0 0.0 -inf inf 0 -0.004920
tottime -0.051342 0.0 0.0 -inf inf 0 -0.051342
<larch.Model (MNL)>

Unlike most scikit-learn estimators, the fit method cannot accept a numpy ndarray, because Larch needs the column names to be able to match up the data to the pre-defined utility function. But we can use the predict, predict_proba and score functions with dataframe inputs.

m.predict(df)
req_data does not request avail_ca or avail_co but it is set and being provided
      altnum
0     1         1.0
      2         0.0
      3         0.0
      4         0.0
      5         0.0
               ... 
5028  2         0.0
      3         0.0
      4         0.0
      5         0.0
      6         0.0
Length: 22033, dtype: float64
proba = m.predict_proba(df)
proba.head(10)
req_data does not request avail_ca or avail_co but it is set and being provided
   altnum
0  1         0.817458
   2         0.077710
   3         0.017906
   4         0.071428
   5         0.015497
1  1         0.336928
   2         0.074339
   3         0.052072
   4         0.498117
   5         0.038545
dtype: float64
score = m.score(df, y=df.chose)
score
req_data does not request avail_ca or avail_co but it is set and being provided
-0.7210551313408093
score * m.dataframes.n_cases
-3626.18625551293

Using Scikit-Learn within Larch

It is also possible to use machine learning methods in a chained model with Larch. This can be implemented through a “prelearning” step, which builds a predictor using some other machine learning method, and then adding the result of that prediction as an input into the discrete choice model.

Use this power with great care! Applying a prelearner can result in over-fitting, spoil the interpretability of some or all of the model parameters, and create other challenging problems. Achieving an amazingly good log likelihood is not necessarily a sign that you have a good model.

import larch.prelearning
dfs = larch.DataFrames(df.drop(columns=['casenum','altnum']), ch='chose', crack=True)
prelearned = larch.prelearning.XGBoostPrelearner(
    dfs,
    ca_columns=['totcost', 'tottime'],
    co_columns=['numveh', 'hhsize', 'hhinc', 'famtype', 'age'],
    eval_metric='logloss',
)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 prelearned = larch.prelearning.XGBoostPrelearner(
      2     dfs,
      3     ca_columns=['totcost', 'tottime'],
      4     co_columns=['numveh', 'hhsize', 'hhinc', 'famtype', 'age'],
      5     eval_metric='logloss',
      6 )

File ~/work/larch/larch/larch/larch/prelearning.py:396, in XGBoostPrelearner.__init__(self, dataframes, ca_columns, co_columns, cache_file, fit, output_name, **kwargs)
    386 def __init__(
    387 		self,
    388 		dataframes,
   (...)
    394 		**kwargs,
    395 ):
--> 396 	from xgboost import XGBRegressor, XGBClassifier
    398 	training_Y = dataframes.array_ch_as_ce()
    399 	use_soft = numpy.any((training_Y != 0) & (training_Y != 1.0))

ModuleNotFoundError: No module named 'xgboost'
dfs1 = prelearned.apply(dfs)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 dfs1 = prelearned.apply(dfs)

NameError: name 'prelearned' is not defined
m = larch.Model(dfs1)

m.utility_ca = (
    PX('tottime') 
    + PX('totcost') 
    + PX('prelearned_utility') 
)
m.utility_co[2] = P("ASC_SR2")  + P("hhinc#2") * X("hhinc")
m.utility_co[3] = P("ASC_SR3P") + P("hhinc#3") * X("hhinc")
m.utility_co[4] = P("ASC_TRAN") + P("hhinc#4") * X("hhinc")
m.utility_co[5] = P("ASC_BIKE") + P("hhinc#5") * X("hhinc")
m.utility_co[6] = P("ASC_WALK") + P("hhinc#6") * X("hhinc")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [16], in <cell line: 1>()
----> 1 m = larch.Model(dfs1)
      3 m.utility_ca = (
      4     PX('tottime') 
      5     + PX('totcost') 
      6     + PX('prelearned_utility') 
      7 )
      8 m.utility_co[2] = P("ASC_SR2")  + P("hhinc#2") * X("hhinc")

NameError: name 'dfs1' is not defined
m.load_data()
m.loglike()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 m.load_data()
      2 m.loglike()

File ~/work/larch/larch/larch/larch/model/controller.pyx:597, in larch.model.controller.Model5c.load_data()

ValueError: dataservice is not defined
m.maximize_loglike()

Iteration 000 [Exception]

LL = ...

value initvalue nullvalue minimum maximum holdfast note
ASC_2 -2.178014 0.0 0.0 -inf inf 0
ASC_3 -3.725078 0.0 0.0 -inf inf 0
ASC_4 -0.670861 0.0 0.0 -inf inf 0
ASC_5 -2.376328 0.0 0.0 -inf inf 0
ASC_6 -0.206775 0.0 0.0 -inf inf 0
HHINC#2 -0.002170 0.0 0.0 -inf inf 0
HHINC#3 0.000358 0.0 0.0 -inf inf 0
HHINC#4 -0.005286 0.0 0.0 -inf inf 0
HHINC#5 -0.012808 0.0 0.0 -inf inf 0
HHINC#6 -0.009686 0.0 0.0 -inf inf 0
totcost -0.004920 0.0 0.0 -inf inf 0
tottime -0.051342 0.0 0.0 -inf inf 0
error in maximize_loglike
Traceback (most recent call last):
  File "/home/runner/work/larch/larch/larch/larch/model/optimization.py", line 258, in maximize_loglike
    current_ll, tolerance, iter_bhhh, steps_bhhh, message = model.simple_fit_bhhh(
  File "larch/model/abstract_model.pyx", line 357, in larch.model.abstract_model.AbstractChoiceModel.simple_fit_bhhh
  File "larch/model/abstract_model.pyx", line 199, in larch.model.abstract_model.AbstractChoiceModel._loglike2_bhhh_tuple
  File "larch/model/controller.pyx", line 787, in larch.model.controller.Model5c.loglike2_bhhh
  File "larch/model/controller.pyx", line 852, in larch.model.controller.Model5c.__prepare_for_compute
larch.exceptions.MissingDataError: model.dataframes does not define data_ch
---------------------------------------------------------------------------
MissingDataError                          Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 m.maximize_loglike()

File ~/work/larch/larch/larch/larch/model/abstract_model.pyx:641, in larch.model.abstract_model.AbstractChoiceModel.maximize_loglike()

File ~/work/larch/larch/larch/larch/model/optimization.py:258, in maximize_loglike(model, method, method2, quiet, screen_update_throttle, final_screen_update, check_for_overspecification, return_tags, reuse_tags, iteration_number, iteration_number_tail, options, maxiter, bhhh_start, jumpstart, jumpstart_split, leave_out, keep_only, subsample, return_dashboard, dashboard, prior_result, **kwargs)
    241         current_ll, tolerance, iter_bhhh, steps_bhhh, message = model.fit_bhhh(
    242             # steplen=1.0,
    243             # momentum=5,
   (...)
    255             # max_constraint_sharpness=1e6,
    256         )
    257     else:
--> 258         current_ll, tolerance, iter_bhhh, steps_bhhh, message = model.simple_fit_bhhh(
    259             ctol=stopping_tol,
    260             maxiter=max_iter,
    261             callback=callback,
    262             jumpstart=jumpstart,
    263             jumpstart_split=jumpstart_split,
    264             leave_out=leave_out,
    265             keep_only=keep_only,
    266             subsample=subsample,
    267         )
    268     raw_result = {
    269         'loglike' :current_ll,
    270         'x': model.pvals,
   (...)
    273         'message' :message,
    274     }
    275 except NotImplementedError:

File ~/work/larch/larch/larch/larch/model/abstract_model.pyx:357, in larch.model.abstract_model.AbstractChoiceModel.simple_fit_bhhh()

File ~/work/larch/larch/larch/larch/model/abstract_model.pyx:199, in larch.model.abstract_model.AbstractChoiceModel._loglike2_bhhh_tuple()

File ~/work/larch/larch/larch/larch/model/controller.pyx:787, in larch.model.controller.Model5c.loglike2_bhhh()

File ~/work/larch/larch/larch/larch/model/controller.pyx:852, in larch.model.controller.Model5c.__prepare_for_compute()

MissingDataError: model.dataframes does not define data_ch