Machine Learning¶

Larch is (mostly) compatible with the scikit-learn stucture for machine learning. Within this structure, the larch.Model object can be used as an estimator and as a predictor.

Note this page applies to the legacy interface for Larch. Updates to enable these features for the numba-based version are coming eventually.

Using Larch within Scikit-Learn¶

import larch
import pandas as pd
from larch import PX, P, X

from larch.data_warehouse import example_file
df = pd.read_csv(example_file("MTCwork.csv.gz"))
df.set_index(['casenum','altnum'], inplace=True, drop=False)

To use the scikit-learn interface, we’ll need to define our model based exclusively on idca or idco format data. We do so here, although we don’t need to actually connect the model to the data yet.

m = larch.Model()

m.utility_ca = (
    PX('tottime') 
    + PX('totcost') 
    + sum(P(f'ASC_{i}') * X(f'altnum=={i}') for i in [2,3,4,5,6])
    + sum(P(f'HHINC#{i}') * X(f'(altnum=={i})*hhinc') for i in [2,3,4,5,6])
)

Because the larch.Model object is an estimator, if offers a fit method to estimate the fitted (likelihood maximizing) parameters. This method for model estimation takes a plain old pandas.DataFrame as the X input. Because this is a regular DataFrame, the data does not internally identify which column[s] contain the observed choice values, so that data must be explictly identified in the method call:

m.fit(df, y=df.chose)

req_data does not request avail_ca or avail_co but it is set and being provided

Iteration 010 [Optimization terminated successfully.]

Best LL = -3626.1862555129305

	value	minimum	maximum	best
ASC_2	-2.178014	-inf	inf	-2.178014
ASC_3	-3.725078	-inf	inf	-3.725078
ASC_4	-0.670861	-inf	inf	-0.670861
ASC_5	-2.376328	-inf	inf	-2.376328
ASC_6	-0.206775	-inf	inf	-0.206775
HHINC#2	-0.002170	-inf	inf	-0.002170
HHINC#3	0.000358	-inf	inf	0.000358
HHINC#4	-0.005286	-inf	inf	-0.005286
HHINC#5	-0.012808	-inf	inf	-0.012808
HHINC#6	-0.009686	-inf	inf	-0.009686
totcost	-0.004920	-inf	inf	-0.004920
tottime	-0.051342	-inf	inf	-0.051342

<larch.Model (MNL)>

Unlike most scikit-learn estimators, the fit method cannot accept a numpy ndarray, because Larch needs the column names to be able to match up the data to the pre-defined utility function. But we can use the predict, predict_proba and score functions with dataframe inputs.

m.predict(df)

req_data does not request avail_ca or avail_co but it is set and being provided

      altnum
   1         1.0
       0.0
       0.0
       0.0
       0.0
               ... 
2         0.0
       0.0
       0.0
       0.0
       0.0
Length: 22033, dtype: float64

proba = m.predict_proba(df)
proba.head(10)

req_data does not request avail_ca or avail_co but it is set and being provided

   altnum
1         0.817458
       0.077710
       0.017906
       0.071428
       0.015497
1         0.336928
       0.074339
       0.052072
       0.498117
       0.038545
dtype: float64

score = m.score(df, y=df.chose)
score

req_data does not request avail_ca or avail_co but it is set and being provided

-0.7210551313408093

score * m.dataframes.n_cases

-3626.18625551293

Using Scikit-Learn within Larch¶

It is also possible to use machine learning methods in a chained model with Larch. This can be implemented through a “prelearning” step, which builds a predictor using some other machine learning method, and then adding the result of that prediction as an input into the discrete choice model.

Use this power with great care! Applying a prelearner can result in over-fitting, spoil the interpretability of some or all of the model parameters, and create other challenging problems. Achieving an amazingly good log likelihood is not necessarily a sign that you have a good model.

import larch.prelearning

dfs = larch.DataFrames(df.drop(columns=['casenum','altnum']), ch='chose', crack=True)

prelearned = larch.prelearning.XGBoostPrelearner(
    dfs,
    ca_columns=['totcost', 'tottime'],
    co_columns=['numveh', 'hhsize', 'hhinc', 'famtype', 'age'],
    eval_metric='logloss',
)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 prelearned = larch.prelearning.XGBoostPrelearner(
      2     dfs,
      3     ca_columns=['totcost', 'tottime'],
      4     co_columns=['numveh', 'hhsize', 'hhinc', 'famtype', 'age'],
      5     eval_metric='logloss',
      6 )

File ~/work/larch/larch/larch/larch/prelearning.py:396, in XGBoostPrelearner.__init__(self, dataframes, ca_columns, co_columns, cache_file, fit, output_name, **kwargs)
    386 def __init__(
    387 		self,
    388 		dataframes,
   (...)
    394 		**kwargs,
    395 ):
--> 396 	from xgboost import XGBRegressor, XGBClassifier
    398 	training_Y = dataframes.array_ch_as_ce()
    399 	use_soft = numpy.any((training_Y != 0) & (training_Y != 1.0))

ModuleNotFoundError: No module named 'xgboost'

dfs1 = prelearned.apply(dfs)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 dfs1 = prelearned.apply(dfs)

NameError: name 'prelearned' is not defined

m = larch.Model(dfs1)

m.utility_ca = (
    PX('tottime') 
    + PX('totcost') 
    + PX('prelearned_utility') 
)
m.utility_co[2] = P("ASC_SR2")  + P("hhinc#2") * X("hhinc")
m.utility_co[3] = P("ASC_SR3P") + P("hhinc#3") * X("hhinc")
m.utility_co[4] = P("ASC_TRAN") + P("hhinc#4") * X("hhinc")
m.utility_co[5] = P("ASC_BIKE") + P("hhinc#5") * X("hhinc")
m.utility_co[6] = P("ASC_WALK") + P("hhinc#6") * X("hhinc")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [16], in <cell line: 1>()
----> 1 m = larch.Model(dfs1)
      3 m.utility_ca = (
      4     PX('tottime') 
      5     + PX('totcost') 
      6     + PX('prelearned_utility') 
      7 )
      8 m.utility_co[2] = P("ASC_SR2")  + P("hhinc#2") * X("hhinc")

NameError: name 'dfs1' is not defined

m.load_data()
m.loglike()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 m.load_data()
      2 m.loglike()

File ~/work/larch/larch/larch/larch/model/controller.pyx:597, in larch.model.controller.Model5c.load_data()

ValueError: dataservice is not defined

m.maximize_loglike()

Iteration 000 [Exception]

LL = ...

	value	minimum	maximum
ASC_2	-2.178014	-inf	inf
ASC_3	-3.725078	-inf	inf
ASC_4	-0.670861	-inf	inf
ASC_5	-2.376328	-inf	inf
ASC_6	-0.206775	-inf	inf
HHINC#2	-0.002170	-inf	inf
HHINC#3	0.000358	-inf	inf
HHINC#4	-0.005286	-inf	inf
HHINC#5	-0.012808	-inf	inf
HHINC#6	-0.009686	-inf	inf
totcost	-0.004920	-inf	inf
tottime	-0.051342	-inf	inf

error in maximize_loglike
Traceback (most recent call last):
  File "/home/runner/work/larch/larch/larch/larch/model/optimization.py", line 258, in maximize_loglike
    current_ll, tolerance, iter_bhhh, steps_bhhh, message = model.simple_fit_bhhh(
  File "larch/model/abstract_model.pyx", line 357, in larch.model.abstract_model.AbstractChoiceModel.simple_fit_bhhh
  File "larch/model/abstract_model.pyx", line 199, in larch.model.abstract_model.AbstractChoiceModel._loglike2_bhhh_tuple
  File "larch/model/controller.pyx", line 787, in larch.model.controller.Model5c.loglike2_bhhh
  File "larch/model/controller.pyx", line 852, in larch.model.controller.Model5c.__prepare_for_compute
larch.exceptions.MissingDataError: model.dataframes does not define data_ch

---------------------------------------------------------------------------
MissingDataError                          Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 m.maximize_loglike()

File ~/work/larch/larch/larch/larch/model/abstract_model.pyx:641, in larch.model.abstract_model.AbstractChoiceModel.maximize_loglike()

File ~/work/larch/larch/larch/larch/model/optimization.py:258, in maximize_loglike(model, method, method2, quiet, screen_update_throttle, final_screen_update, check_for_overspecification, return_tags, reuse_tags, iteration_number, iteration_number_tail, options, maxiter, bhhh_start, jumpstart, jumpstart_split, leave_out, keep_only, subsample, return_dashboard, dashboard, prior_result, **kwargs)
    241         current_ll, tolerance, iter_bhhh, steps_bhhh, message = model.fit_bhhh(
    242             # steplen=1.0,
    243             # momentum=5,
   (...)
    255             # max_constraint_sharpness=1e6,
    256         )
    257     else:
--> 258         current_ll, tolerance, iter_bhhh, steps_bhhh, message = model.simple_fit_bhhh(
    259             ctol=stopping_tol,
    260             maxiter=max_iter,
    261             callback=callback,
    262             jumpstart=jumpstart,
    263             jumpstart_split=jumpstart_split,
    264             leave_out=leave_out,
    265             keep_only=keep_only,
    266             subsample=subsample,
    267         )
    268     raw_result = {
    269         'loglike' :current_ll,
    270         'x': model.pvals,
   (...)
    273         'message' :message,
    274     }
    275 except NotImplementedError:

File ~/work/larch/larch/larch/larch/model/abstract_model.pyx:357, in larch.model.abstract_model.AbstractChoiceModel.simple_fit_bhhh()

File ~/work/larch/larch/larch/larch/model/abstract_model.pyx:199, in larch.model.abstract_model.AbstractChoiceModel._loglike2_bhhh_tuple()

File ~/work/larch/larch/larch/larch/model/controller.pyx:787, in larch.model.controller.Model5c.loglike2_bhhh()

File ~/work/larch/larch/larch/larch/model/controller.pyx:852, in larch.model.controller.Model5c.__prepare_for_compute()

MissingDataError: model.dataframes does not define data_ch

v5.7.0

Machine Learning

Contents

Machine Learning¶

Using Larch within Scikit-Learn¶

Iteration 010 [Optimization terminated successfully.]

Using Scikit-Learn within Larch¶

Iteration 000 [Exception]