Machine Learning
Contents
Machine Learning¶
Larch is (mostly) compatible with the scikit-learn stucture for machine learning.
Within this structure, the larch.Model object can be used as an estimator
and as a predictor
.
Note this page applies to the legacy interface for Larch. Updates to enable these features for the numba-based version are coming eventually.
Using Larch within Scikit-Learn¶
import larch
import pandas as pd
from larch import PX, P, X
from larch.data_warehouse import example_file
df = pd.read_csv(example_file("MTCwork.csv.gz"))
df.set_index(['casenum','altnum'], inplace=True, drop=False)
To use the scikit-learn interface, we’ll need to define our model based exclusively on idca or idco format data. We do so here, although we don’t need to actually connect the model to the data yet.
m = larch.Model()
m.utility_ca = (
PX('tottime')
+ PX('totcost')
+ sum(P(f'ASC_{i}') * X(f'altnum=={i}') for i in [2,3,4,5,6])
+ sum(P(f'HHINC#{i}') * X(f'(altnum=={i})*hhinc') for i in [2,3,4,5,6])
)
Because the larch.Model object is an estimator, if offers a fit
method to estimate the fitted (likelihood maximizing) parameters. This method
for model estimation takes a plain old pandas.DataFrame as the X
input. Because
this is a regular DataFrame, the data does not internally identify which column[s]
contain the observed choice values, so that data must be explictly identified
in the method call:
m.fit(df, y=df.chose)
req_data does not request avail_ca or avail_co but it is set and being provided
Iteration 010 [Optimization terminated successfully.]
Best LL = -3626.1862555129305
value | initvalue | nullvalue | minimum | maximum | holdfast | note | best | |
---|---|---|---|---|---|---|---|---|
ASC_2 | -2.178014 | 0.0 | 0.0 | -inf | inf | 0 | -2.178014 | |
ASC_3 | -3.725078 | 0.0 | 0.0 | -inf | inf | 0 | -3.725078 | |
ASC_4 | -0.670861 | 0.0 | 0.0 | -inf | inf | 0 | -0.670861 | |
ASC_5 | -2.376328 | 0.0 | 0.0 | -inf | inf | 0 | -2.376328 | |
ASC_6 | -0.206775 | 0.0 | 0.0 | -inf | inf | 0 | -0.206775 | |
HHINC#2 | -0.002170 | 0.0 | 0.0 | -inf | inf | 0 | -0.002170 | |
HHINC#3 | 0.000358 | 0.0 | 0.0 | -inf | inf | 0 | 0.000358 | |
HHINC#4 | -0.005286 | 0.0 | 0.0 | -inf | inf | 0 | -0.005286 | |
HHINC#5 | -0.012808 | 0.0 | 0.0 | -inf | inf | 0 | -0.012808 | |
HHINC#6 | -0.009686 | 0.0 | 0.0 | -inf | inf | 0 | -0.009686 | |
totcost | -0.004920 | 0.0 | 0.0 | -inf | inf | 0 | -0.004920 | |
tottime | -0.051342 | 0.0 | 0.0 | -inf | inf | 0 | -0.051342 |
<larch.Model (MNL)>
Unlike most scikit-learn estimators, the fit method cannot
accept a numpy ndarray, because Larch needs the column names to be able
to match up the data to the pre-defined utility function. But we can
use the predict
, predict_proba
and score
functions with dataframe inputs.
m.predict(df)
req_data does not request avail_ca or avail_co but it is set and being provided
altnum
0 1 1.0
2 0.0
3 0.0
4 0.0
5 0.0
...
5028 2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
Length: 22033, dtype: float64
proba = m.predict_proba(df)
proba.head(10)
req_data does not request avail_ca or avail_co but it is set and being provided
altnum
0 1 0.817458
2 0.077710
3 0.017906
4 0.071428
5 0.015497
1 1 0.336928
2 0.074339
3 0.052072
4 0.498117
5 0.038545
dtype: float64
score = m.score(df, y=df.chose)
score
req_data does not request avail_ca or avail_co but it is set and being provided
-0.7210551313408093
score * m.dataframes.n_cases
-3626.18625551293
Using Scikit-Learn within Larch¶
It is also possible to use machine learning methods in a chained model with Larch. This can be implemented through a “prelearning” step, which builds a predictor using some other machine learning method, and then adding the result of that prediction as an input into the discrete choice model.
Use this power with great care! Applying a prelearner can result in over-fitting, spoil the interpretability of some or all of the model parameters, and create other challenging problems. Achieving an amazingly good log likelihood is not necessarily a sign that you have a good model.
import larch.prelearning
dfs = larch.DataFrames(df.drop(columns=['casenum','altnum']), ch='chose', crack=True)
prelearned = larch.prelearning.XGBoostPrelearner(
dfs,
ca_columns=['totcost', 'tottime'],
co_columns=['numveh', 'hhsize', 'hhinc', 'famtype', 'age'],
eval_metric='logloss',
)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 prelearned = larch.prelearning.XGBoostPrelearner(
2 dfs,
3 ca_columns=['totcost', 'tottime'],
4 co_columns=['numveh', 'hhsize', 'hhinc', 'famtype', 'age'],
5 eval_metric='logloss',
6 )
File ~/work/larch/larch/larch/larch/prelearning.py:396, in XGBoostPrelearner.__init__(self, dataframes, ca_columns, co_columns, cache_file, fit, output_name, **kwargs)
386 def __init__(
387 self,
388 dataframes,
(...)
394 **kwargs,
395 ):
--> 396 from xgboost import XGBRegressor, XGBClassifier
398 training_Y = dataframes.array_ch_as_ce()
399 use_soft = numpy.any((training_Y != 0) & (training_Y != 1.0))
ModuleNotFoundError: No module named 'xgboost'
dfs1 = prelearned.apply(dfs)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 dfs1 = prelearned.apply(dfs)
NameError: name 'prelearned' is not defined
m = larch.Model(dfs1)
m.utility_ca = (
PX('tottime')
+ PX('totcost')
+ PX('prelearned_utility')
)
m.utility_co[2] = P("ASC_SR2") + P("hhinc#2") * X("hhinc")
m.utility_co[3] = P("ASC_SR3P") + P("hhinc#3") * X("hhinc")
m.utility_co[4] = P("ASC_TRAN") + P("hhinc#4") * X("hhinc")
m.utility_co[5] = P("ASC_BIKE") + P("hhinc#5") * X("hhinc")
m.utility_co[6] = P("ASC_WALK") + P("hhinc#6") * X("hhinc")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [16], in <cell line: 1>()
----> 1 m = larch.Model(dfs1)
3 m.utility_ca = (
4 PX('tottime')
5 + PX('totcost')
6 + PX('prelearned_utility')
7 )
8 m.utility_co[2] = P("ASC_SR2") + P("hhinc#2") * X("hhinc")
NameError: name 'dfs1' is not defined
m.load_data()
m.loglike()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 m.load_data()
2 m.loglike()
File ~/work/larch/larch/larch/larch/model/controller.pyx:597, in larch.model.controller.Model5c.load_data()
ValueError: dataservice is not defined
m.maximize_loglike()
Iteration 000 [Exception]
LL = ...
value | initvalue | nullvalue | minimum | maximum | holdfast | note | |
---|---|---|---|---|---|---|---|
ASC_2 | -2.178014 | 0.0 | 0.0 | -inf | inf | 0 | |
ASC_3 | -3.725078 | 0.0 | 0.0 | -inf | inf | 0 | |
ASC_4 | -0.670861 | 0.0 | 0.0 | -inf | inf | 0 | |
ASC_5 | -2.376328 | 0.0 | 0.0 | -inf | inf | 0 | |
ASC_6 | -0.206775 | 0.0 | 0.0 | -inf | inf | 0 | |
HHINC#2 | -0.002170 | 0.0 | 0.0 | -inf | inf | 0 | |
HHINC#3 | 0.000358 | 0.0 | 0.0 | -inf | inf | 0 | |
HHINC#4 | -0.005286 | 0.0 | 0.0 | -inf | inf | 0 | |
HHINC#5 | -0.012808 | 0.0 | 0.0 | -inf | inf | 0 | |
HHINC#6 | -0.009686 | 0.0 | 0.0 | -inf | inf | 0 | |
totcost | -0.004920 | 0.0 | 0.0 | -inf | inf | 0 | |
tottime | -0.051342 | 0.0 | 0.0 | -inf | inf | 0 |
error in maximize_loglike
Traceback (most recent call last):
File "/home/runner/work/larch/larch/larch/larch/model/optimization.py", line 258, in maximize_loglike
current_ll, tolerance, iter_bhhh, steps_bhhh, message = model.simple_fit_bhhh(
File "larch/model/abstract_model.pyx", line 357, in larch.model.abstract_model.AbstractChoiceModel.simple_fit_bhhh
File "larch/model/abstract_model.pyx", line 199, in larch.model.abstract_model.AbstractChoiceModel._loglike2_bhhh_tuple
File "larch/model/controller.pyx", line 787, in larch.model.controller.Model5c.loglike2_bhhh
File "larch/model/controller.pyx", line 852, in larch.model.controller.Model5c.__prepare_for_compute
larch.exceptions.MissingDataError: model.dataframes does not define data_ch
---------------------------------------------------------------------------
MissingDataError Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 m.maximize_loglike()
File ~/work/larch/larch/larch/larch/model/abstract_model.pyx:641, in larch.model.abstract_model.AbstractChoiceModel.maximize_loglike()
File ~/work/larch/larch/larch/larch/model/optimization.py:258, in maximize_loglike(model, method, method2, quiet, screen_update_throttle, final_screen_update, check_for_overspecification, return_tags, reuse_tags, iteration_number, iteration_number_tail, options, maxiter, bhhh_start, jumpstart, jumpstart_split, leave_out, keep_only, subsample, return_dashboard, dashboard, prior_result, **kwargs)
241 current_ll, tolerance, iter_bhhh, steps_bhhh, message = model.fit_bhhh(
242 # steplen=1.0,
243 # momentum=5,
(...)
255 # max_constraint_sharpness=1e6,
256 )
257 else:
--> 258 current_ll, tolerance, iter_bhhh, steps_bhhh, message = model.simple_fit_bhhh(
259 ctol=stopping_tol,
260 maxiter=max_iter,
261 callback=callback,
262 jumpstart=jumpstart,
263 jumpstart_split=jumpstart_split,
264 leave_out=leave_out,
265 keep_only=keep_only,
266 subsample=subsample,
267 )
268 raw_result = {
269 'loglike' :current_ll,
270 'x': model.pvals,
(...)
273 'message' :message,
274 }
275 except NotImplementedError:
File ~/work/larch/larch/larch/larch/model/abstract_model.pyx:357, in larch.model.abstract_model.AbstractChoiceModel.simple_fit_bhhh()
File ~/work/larch/larch/larch/larch/model/abstract_model.pyx:199, in larch.model.abstract_model.AbstractChoiceModel._loglike2_bhhh_tuple()
File ~/work/larch/larch/larch/larch/model/controller.pyx:787, in larch.model.controller.Model5c.loglike2_bhhh()
File ~/work/larch/larch/larch/larch/model/controller.pyx:852, in larch.model.controller.Model5c.__prepare_for_compute()
MissingDataError: model.dataframes does not define data_ch