201: Exampville Mode Choice

Welcome to Exampville, the best simulated town in this here part of the internet!

Exampville is a demonstration provided with Larch that walks through some of the data and tools that a transportation planner might use when building a travel model.

[1]:
import larch, numpy, pandas, os
from larch import P, X
[2]:
larch.info
[2]:

Larch 5.2.11 /Users/jpn/Larch/Larch/larch

In this example notebook, we will walk through the creation of a tour mode choice model.

To begin, we’ll re-load the tours and skims data from the data setup example.

[3]:
hh, pp, tour, skims = larch.example(200, ['hh', 'pp', 'tour', 'skims'])

Preprocessing

The Exampville data output contains a set of files similar to what we might find for a real travel survey: network skims, and tables of households, persons, and tours. We’ll need to merge these tables to create a composite dataset for mode choice model estimation.

We can merge data from other tables using the usual pandas syntax for merging.

[4]:
raw = tour.merge(hh, on='HHID').merge(pp, on=('HHID', 'PERSONID'))

Our zone numbering system starts with zone 1, as is common for many TAZ numbering systems seen in practice. But, for looking up data in the skims matrix, we’ll need to use zero-based numbering that is standard in Python. So we’ll create two new TAZ-index columns to assist this process.

[5]:
raw["HOMETAZi"] = raw["HOMETAZ"] - 1
raw["DTAZi"] = raw["DTAZ"] - 1
[6]:
raw.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15934 entries, 0 to 15933
Data columns (total 20 columns):
TOURID           15934 non-null int64
HHID             15934 non-null int64
PERSONID         15934 non-null int64
DTAZ             15934 non-null int64
TOURMODE         15934 non-null int64
TOURPURP         15934 non-null int64
X                15934 non-null float64
Y                15934 non-null float64
INCOME           15934 non-null int64
geometry         15934 non-null object
HOMETAZ          15934 non-null int64
HHSIZE           15934 non-null int64
HHIDX            15934 non-null int64
AGE              15934 non-null int64
WORKS            15934 non-null int64
N_WORK_TOURS     15934 non-null int64
N_OTHER_TOURS    15934 non-null int64
N_TOURS          15934 non-null int64
HOMETAZi         15934 non-null int64
DTAZi            15934 non-null int64
dtypes: float64(2), int64(17), object(1)
memory usage: 2.6+ MB

In Exampville, there are only two kinds of trips:

  • work (purpose=1) and
  • non-work (purpose=2).

We want to estimate a mode choice model for work trips, so we’ll begin by excluding all the other trips:

[7]:
raw = raw[raw.TOURPURP == 1]
[8]:
raw.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4952 entries, 0 to 15931
Data columns (total 20 columns):
TOURID           4952 non-null int64
HHID             4952 non-null int64
PERSONID         4952 non-null int64
DTAZ             4952 non-null int64
TOURMODE         4952 non-null int64
TOURPURP         4952 non-null int64
X                4952 non-null float64
Y                4952 non-null float64
INCOME           4952 non-null int64
geometry         4952 non-null object
HOMETAZ          4952 non-null int64
HHSIZE           4952 non-null int64
HHIDX            4952 non-null int64
AGE              4952 non-null int64
WORKS            4952 non-null int64
N_WORK_TOURS     4952 non-null int64
N_OTHER_TOURS    4952 non-null int64
N_TOURS          4952 non-null int64
HOMETAZi         4952 non-null int64
DTAZi            4952 non-null int64
dtypes: float64(2), int64(17), object(1)
memory usage: 812.4+ KB
[9]:
skims
[9]:
<larch.OMX> ⋯/exampville_skims.omx
 |  shape:(40, 40)
 |  data:
 |    AUTO_COST    (float64)
 |    AUTO_DIST    (float64)
 |    AUTO_TIME    (float64)
 |    BIKE_TIME    (float64)
 |    TRANSIT_FARE (float64)
 |    TRANSIT_IVTT (float64)
 |    TRANSIT_OVTT (float64)
 |    WALK_DIST    (float64)
 |    WALK_TIME    (float64)

For this tour mode choice model, we can pick values out of the skims for the known O-D of the tour, so that we have access to the level-of-service data for each possible mode serving that O-D pair. We’ll use the get_rc_dataframe method of the larch.OMX object, which lets us give the a list of the index for the production and attraction (row and column) value we want to use.

[10]:
raw = raw.join(
    skims.get_rc_dataframe(
        raw.HOMETAZi, raw.DTAZi,
    )
)

We can review the raw frame to see what variables are now included.

[11]:
raw.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4952 entries, 0 to 15931
Data columns (total 29 columns):
TOURID           4952 non-null int64
HHID             4952 non-null int64
PERSONID         4952 non-null int64
DTAZ             4952 non-null int64
TOURMODE         4952 non-null int64
TOURPURP         4952 non-null int64
X                4952 non-null float64
Y                4952 non-null float64
INCOME           4952 non-null int64
geometry         4952 non-null object
HOMETAZ          4952 non-null int64
HHSIZE           4952 non-null int64
HHIDX            4952 non-null int64
AGE              4952 non-null int64
WORKS            4952 non-null int64
N_WORK_TOURS     4952 non-null int64
N_OTHER_TOURS    4952 non-null int64
N_TOURS          4952 non-null int64
HOMETAZi         4952 non-null int64
DTAZi            4952 non-null int64
AUTO_COST        4952 non-null float64
AUTO_DIST        4952 non-null float64
AUTO_TIME        4952 non-null float64
BIKE_TIME        4952 non-null float64
TRANSIT_FARE     4952 non-null float64
TRANSIT_IVTT     4952 non-null float64
TRANSIT_OVTT     4952 non-null float64
WALK_DIST        4952 non-null float64
WALK_TIME        4952 non-null float64
dtypes: float64(11), int64(17), object(1)
memory usage: 1.3+ MB

Now we can assemble this data, along with the definitions of the alternatives, into a larch.DataFrames object. This object links information about the raw data and the choices into one data structure. It can handle multiple data formats (including idco and idca data simultanously), but here we’ll only be using a single idco format data table.

[12]:
# For clarity, we can define numbers as names for modes
DA = 1
SR = 2
Walk = 3
Bike = 4
Transit = 5
[13]:
dfs = larch.DataFrames(
    co=raw,
    alt_codes=[DA,SR,Walk,Bike,Transit],
    alt_names=['DA','SR','Walk','Bike','Transit'],
    ch_name='TOURMODE',
)

Model Definition

And then we are ready to create our model.

[14]:
m = larch.Model(dataservice=dfs)
m.title = "Exampville Work Tour Mode Choice v1"

We will explicitly define the set of utility functions we want to use. Because the DataFrames we are using to serve data to this model contains exclusively idco format data, we’ll use only the utility_co mapping to define a unique utility function for each alternative.

[15]:
m.utility_co[DA] = (
        + P.InVehTime * X.AUTO_TIME
        + P.Cost * X.AUTO_COST # dollars per mile
)

m.utility_co[SR] = (
        + P.ASC_SR
        + P.InVehTime * X.AUTO_TIME
        + P.Cost * (X.AUTO_COST * 0.5) # dollars per mile, half share
        + P("LogIncome:SR") * X("log(INCOME)")
)

m.utility_co[Walk] = (
        + P.ASC_Walk
        + P.NonMotorTime * X.WALK_TIME
        + P("LogIncome:Walk") * X("log(INCOME)")
)

m.utility_co[Bike] = (
        + P.ASC_Bike
        + P.NonMotorTime * X.BIKE_TIME
        + P("LogIncome:Bike") * X("log(INCOME)")
)

m.utility_co[Transit] = (
        + P.ASC_Transit
        + P.InVehTime * X.TRANSIT_IVTT
        + P.OutVehTime * X.TRANSIT_OVTT
        + P.Cost * X.TRANSIT_FARE
        + P("LogIncome:Transit") * X('log(INCOME)')
)

To write a nested logit mode, we’ll attach some nesting nodes to the model’s graph. Each new_node allows us to define the set of codes for the child nodes (elemental alternatives, or lower level nests) as well as giving the new nest a name and assigning a logsum parameter. The return value of this method is the node code for the newly created nest, which then can potenially be used as a child code when creating a higher level nest. We do this below, adding the ‘Car’ nest into the ‘Motor’ nest.

[16]:
Car = m.graph.new_node(parameter='Mu:Car', children=[DA,SR], name='Car')
NonMotor = m.graph.new_node(parameter='Mu:NonMotor', children=[Walk,Bike], name='NonMotor')
Motor = m.graph.new_node(parameter='Mu:Motor', children=[Car,Transit], name='Motor')

Let’s visually check on the nesting structure.

[17]:
m.graph
[17]:
%3 cluster_elemental 0 0 7 (7) NonMotor Mu:NonMotor 0->7 8 (8) Motor Mu:Motor 0->8 1 (1) DA 2 (2) SR 3 (3) Walk 4 (4) Bike 5 (5) Transit 6 (6) Car Mu:Car 6->1 6->2 7->3 7->4 8->5 8->6

The tour mode choice model’s choice variable is indicated by the code value in ‘TOURMODE’, and this can be defined for the model using choice_co_code.

[18]:
m.choice_co_code = 'TOURMODE'

We can also give a dictionary of availability conditions based on values in the idco data, using the availability_co_vars attribute. Alternatives that are always available can be indicated by setting the criterion to 1.

[19]:
m.availability_co_vars = {
    DA: 'AGE >= 16',
    SR: 1,
    Walk: 'WALK_TIME < 60',
    Bike: 'BIKE_TIME < 60',
    Transit: 'TRANSIT_FARE>0',
}

Then let’s prepare this data for estimation. Even though the data is already in memory, the load_data method is used to pre-process the data, extracting the required values, pre-computing the values of fixed expressions, and assembling the results into contiguous arrays suitable for computing the log likelihood values efficiently.

Model Estimation

[20]:
m.load_data()

We can check on some important statistics of this loaded data even before we estimate the model.

[21]:
m.dataframes.choice_avail_summary()
[21]:
name chosen available
1 DA 3984.0 4952
2 SR 570.0 4952
3 Walk 114.0 2789
4 Bike 47.0 4952
5 Transit 237.0 2651
< Total All Alternatives > 4952.0
[22]:
m.dataframes.data_co.statistics()
[22]:
nminimummaximummedianhistogrammeanstdevzerospositivesnegativesnonzero_minimumnonzero_maximumnonzero_meannonzero_stdev
AUTO_COST49520.1949264.307960.991167
1.197670.7469320495200.1949264.307961.197670.746932
AUTO_COST*(0.5)49520.09746282.153980.495584
0.5988340.3734660495200.09746282.153980.5988340.373466
AUTO_TIME49520.93000830.80237.55075
8.242454.604590495200.93000830.80238.242454.60459
BIKE_TIME49522.7846552.232113.3696
15.86099.079420495202.7846552.232115.86099.07942
TRANSIT_FARE495202.52.5
Histograms are purple if the data is represented as discrete values.
1.338351.246872301265102.52.52.50
TRANSIT_IVTT4952012.16680.959322
Histograms are orange if the zeros are numerous and have been excluded.
2.519453.353362301265100.95932212.16684.706263.27318
TRANSIT_OVTT49520.759059127.03139.2568
39.239824.05740495200.759059127.03139.239824.0574
WALK_TIME495211.1386208.92853.4784
63.443536.317704952011.1386208.92863.443536.3177
log(INCOME)49527.077514.852710.4983
10.47941.167670495207.077514.852710.47941.16767

If we are satisfied with the statistics we see above, we can go ahead and estimate the model.

[23]:
result = m.maximize_loglike(method='slsqp')

Iteration 043 [Converged]

LL = -2384.571967781753

value initvalue nullvalue minimum maximum holdfast note best
ASC_Bike 0.059887 0.0 0.0 -inf inf 0 0.059887
ASC_SR 1.162299 0.0 0.0 -inf inf 0 1.162299
ASC_Transit 4.565357 0.0 0.0 -inf inf 0 4.565357
ASC_Walk 7.365430 0.0 0.0 -inf inf 0 7.365430
Cost -0.376209 0.0 0.0 -inf inf 0 -0.376209
InVehTime -0.090880 0.0 0.0 -inf inf 0 -0.090880
LogIncome:Bike -0.243509 0.0 0.0 -inf inf 0 -0.243509
LogIncome:SR -0.267938 0.0 0.0 -inf inf 0 -0.267938
LogIncome:Transit -0.318997 0.0 0.0 -inf inf 0 -0.318997
LogIncome:Walk -0.468722 0.0 0.0 -inf inf 0 -0.468722
Mu:Car 0.700454 1.0 1.0 0.001 1.0 0 0.700454
Mu:Motor 0.691102 1.0 1.0 0.001 1.0 0 0.691102
Mu:NonMotor 0.749037 1.0 1.0 0.001 1.0 0 0.749037
NonMotorTime -0.235496 0.0 0.0 -inf inf 0 -0.235496
OutVehTime -0.248537 0.0 0.0 -inf inf 0 -0.248537

After we find the best fitting parameters, we can compute some variance-covariance statistics, incuding standard errors of the estimates and t statistics, using calculate_parameter_covariance.

[24]:
m.calculate_parameter_covariance()

Then we can review the results in a variety of report tables.

[25]:
m.parameter_summary()
[25]:
ParameterValueStd Errt StatNull Value
ASC_Bike0.059891.250.050.0
ASC_SR1.1620.7841.480.0
ASC_Transit4.5651.982.310.0
ASC_Walk7.3651.176.290.0
Cost-0.37620.206-1.830.0
InVehTime-0.090880.0388-2.340.0
LogIncomeBike-0.24350.120-2.020.0
SR-0.26790.164-1.640.0
Transit-0.3190.143-2.230.0
Walk-0.46870.100-4.670.0
MuCar0.70050.418-0.721.0
Motor0.69110.269-1.151.0
NonMotor0.7490.129-1.951.0
NonMotorTime-0.23550.0215-10.970.0
OutVehTime-0.24850.0979-2.540.0
[26]:
m.estimation_statistics()
[26]:
StatisticAggregatePer Case
Number of Cases4952
Log Likelihood at Convergence-2384.57-0.48
Log Likelihood at Null Parameters-6958.98-1.41
Rho Squared w.r.t. Null Parameters0.657

Save and Report Model

[27]:
report = larch.Reporter(title=m.title)
[28]:
report << '# Parameter Summary' << m.parameter_summary()
[28]:

Parameter Summary

ParameterValueStd Errt StatNull Value
ASC_Bike0.059891.250.050.0
ASC_SR1.1620.7841.480.0
ASC_Transit4.5651.982.310.0
ASC_Walk7.3651.176.290.0
Cost-0.37620.206-1.830.0
InVehTime-0.090880.0388-2.340.0
LogIncomeBike-0.24350.120-2.020.0
SR-0.26790.164-1.640.0
Transit-0.3190.143-2.230.0
Walk-0.46870.100-4.670.0
MuCar0.70050.418-0.721.0
Motor0.69110.269-1.151.0
NonMotor0.7490.129-1.951.0
NonMotorTime-0.23550.0215-10.970.0
OutVehTime-0.24850.0979-2.540.0
[29]:
report << "# Estimation Statistics" << m.estimation_statistics()
[29]:

Estimation Statistics

StatisticAggregatePer Case
Number of Cases4952
Log Likelihood at Convergence-2384.57-0.48
Log Likelihood at Null Parameters-6958.98-1.41
Rho Squared w.r.t. Null Parameters0.657
[30]:
report << "# Utility Functions" << m.utility_functions()
[30]:

Utility Functions

altformula
1
+
P.InVehTime-0.0909
*
X.AUTO_TIMEThis is Data

+
P.Cost-0.376
*
X.AUTO_COSTThis is Data
2
+
P.ASC_SR1.16

+
P.InVehTime-0.0909
*
X.AUTO_TIMEThis is Data

+
P.Cost-0.376
*
X('AUTO_COST*(0.5)')This is Data

+
P('LogIncome:SR')-0.268
*
X('log(INCOME)')This is Data
3
+
P.ASC_Walk7.37

+
P.NonMotorTime-0.235
*
X.WALK_TIMEThis is Data

+
P('LogIncome:Walk')-0.469
*
X('log(INCOME)')This is Data
4
+
P.ASC_Bike0.0599

+
P.NonMotorTime-0.235
*
X.BIKE_TIMEThis is Data

+
P('LogIncome:Bike')-0.244
*
X('log(INCOME)')This is Data
5
+
P.ASC_Transit4.57

+
P.InVehTime-0.0909
*
X.TRANSIT_IVTTThis is Data

+
P.OutVehTime-0.249
*
X.TRANSIT_OVTTThis is Data

+
P.Cost-0.376
*
X.TRANSIT_FAREThis is Data

+
P('LogIncome:Transit')-0.319
*
X('log(INCOME)')This is Data
[31]:
report.save(
    '/tmp/exampville_mode_choice.html',
    overwrite=True,
    metadata=m,
)
[31]:
'/tmp/exampville_mode_choice.html'