201: Exampville Mode Choice

Welcome to Exampville, the best simulated town in this here part of the internet!

Exampville is a demonstration provided with Larch that walks through some of the data and tools that a transportation planner might use when building a travel model.

[1]:
import larch, numpy, pandas, os
from larch import P, X
[2]:
larch.__version__
[2]:
'5.4.0'

In this example notebook, we will walk through the creation of a tour mode choice model.

To begin, we’ll re-load the tours and skims data from the data setup example.

[3]:
hh, pp, tour, skims = larch.example(200, ['hh', 'pp', 'tour', 'skims'])

Preprocessing

The Exampville data output contains a set of files similar to what we might find for a real travel survey: network skims, and tables of households, persons, and tours. We’ll need to merge these tables to create a composite dataset for mode choice model estimation.

We can merge data from other tables using the usual pandas syntax for merging.

[4]:
raw = tour.merge(hh, on='HHID').merge(pp, on=('HHID', 'PERSONID'))

Our zone numbering system starts with zone 1, as is common for many TAZ numbering systems seen in practice. But, for looking up data in the skims matrix, we’ll need to use zero-based numbering that is standard in Python. So we’ll create two new TAZ-index columns to assist this process.

[5]:
raw["HOMETAZi"] = raw["HOMETAZ"] - 1
raw["DTAZi"] = raw["DTAZ"] - 1
[6]:
raw.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20739 entries, 0 to 20738
Data columns (total 35 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   TOURID         20739 non-null  int64
 1   HHID           20739 non-null  int64
 2   PERSONID       20739 non-null  int64
 3   DTAZ           20739 non-null  int64
 4   TOURMODE       20739 non-null  int64
 5   TOURPURP       20739 non-null  int64
 6   N_STOPS        20739 non-null  int64
 7   N_TRIPS_x      20739 non-null  int64
 8   N_TRIPS_HBW_x  20739 non-null  int64
 9   N_TRIPS_HBO_x  20739 non-null  int64
 10  N_TRIPS_NHB_x  20739 non-null  int64
 11  X              20739 non-null  float64
 12  Y              20739 non-null  float64
 13  INCOME         20739 non-null  float64
 14  N_VEHICLES     20739 non-null  int64
 15  HHSIZE         20739 non-null  int64
 16  geometry       20739 non-null  object
 17  HOMETAZ        20739 non-null  int64
 18  N_TRIPS_y      20739 non-null  int64
 19  N_TRIPS_HBW_y  20739 non-null  int64
 20  N_TRIPS_HBO_y  20739 non-null  int64
 21  N_TRIPS_NHB_y  20739 non-null  int64
 22  N_WORKERS      20739 non-null  int64
 23  HHIDX          20739 non-null  int64
 24  AGE            20739 non-null  int64
 25  WORKS          20739 non-null  int64
 26  N_WORK_TOURS   20739 non-null  int64
 27  N_OTHER_TOURS  20739 non-null  int64
 28  N_TOURS        20739 non-null  int64
 29  N_TRIPS        20739 non-null  int64
 30  N_TRIPS_HBW    20739 non-null  int64
 31  N_TRIPS_HBO    20739 non-null  int64
 32  N_TRIPS_NHB    20739 non-null  int64
 33  HOMETAZi       20739 non-null  int64
 34  DTAZi          20739 non-null  int64
dtypes: float64(3), int64(31), object(1)
memory usage: 5.7+ MB

In Exampville, there are only two kinds of trips:

  • work (purpose=1) and
  • non-work (purpose=2).

We want to estimate a mode choice model for work trips, so we’ll begin by excluding all the other trips:

[7]:
raw = raw[raw.TOURPURP == 1]
[8]:
raw.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7564 entries, 0 to 20736
Data columns (total 35 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   TOURID         7564 non-null   int64
 1   HHID           7564 non-null   int64
 2   PERSONID       7564 non-null   int64
 3   DTAZ           7564 non-null   int64
 4   TOURMODE       7564 non-null   int64
 5   TOURPURP       7564 non-null   int64
 6   N_STOPS        7564 non-null   int64
 7   N_TRIPS_x      7564 non-null   int64
 8   N_TRIPS_HBW_x  7564 non-null   int64
 9   N_TRIPS_HBO_x  7564 non-null   int64
 10  N_TRIPS_NHB_x  7564 non-null   int64
 11  X              7564 non-null   float64
 12  Y              7564 non-null   float64
 13  INCOME         7564 non-null   float64
 14  N_VEHICLES     7564 non-null   int64
 15  HHSIZE         7564 non-null   int64
 16  geometry       7564 non-null   object
 17  HOMETAZ        7564 non-null   int64
 18  N_TRIPS_y      7564 non-null   int64
 19  N_TRIPS_HBW_y  7564 non-null   int64
 20  N_TRIPS_HBO_y  7564 non-null   int64
 21  N_TRIPS_NHB_y  7564 non-null   int64
 22  N_WORKERS      7564 non-null   int64
 23  HHIDX          7564 non-null   int64
 24  AGE            7564 non-null   int64
 25  WORKS          7564 non-null   int64
 26  N_WORK_TOURS   7564 non-null   int64
 27  N_OTHER_TOURS  7564 non-null   int64
 28  N_TOURS        7564 non-null   int64
 29  N_TRIPS        7564 non-null   int64
 30  N_TRIPS_HBW    7564 non-null   int64
 31  N_TRIPS_HBO    7564 non-null   int64
 32  N_TRIPS_NHB    7564 non-null   int64
 33  HOMETAZi       7564 non-null   int64
 34  DTAZi          7564 non-null   int64
dtypes: float64(3), int64(31), object(1)
memory usage: 2.1+ MB
[9]:
skims
[9]:
<larch.OMX> ⋯/exampville_skims.omx
 |  shape:(40, 40)
 |  data:
 |    AUTO_COST    (float64)
 |    AUTO_DIST    (float64)
 |    AUTO_TIME    (float64)
 |    BIKE_TIME    (float64)
 |    TRANSIT_FARE (float64)
 |    TRANSIT_IVTT (float64)
 |    TRANSIT_OVTT (float64)
 |    WALK_DIST    (float64)
 |    WALK_TIME    (float64)
 |  lookup:
 |    TAZ_AREA_TYPE (40 |S3)
 |    TAZ_ID        (40 int64)

For this tour mode choice model, we can pick values out of the skims for the known O-D of the tour, so that we have access to the level-of-service data for each possible mode serving that O-D pair. We’ll use the get_rc_dataframe method of the larch.OMX object, which lets us give the a list of the index for the production and attraction (row and column) value we want to use.

[10]:
raw = raw.join(
    skims.get_rc_dataframe(
        raw.HOMETAZi, raw.DTAZi,
    )
)

We can review the raw frame to see what variables are now included.

[11]:
raw.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7564 entries, 0 to 20736
Data columns (total 44 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   TOURID         7564 non-null   int64
 1   HHID           7564 non-null   int64
 2   PERSONID       7564 non-null   int64
 3   DTAZ           7564 non-null   int64
 4   TOURMODE       7564 non-null   int64
 5   TOURPURP       7564 non-null   int64
 6   N_STOPS        7564 non-null   int64
 7   N_TRIPS_x      7564 non-null   int64
 8   N_TRIPS_HBW_x  7564 non-null   int64
 9   N_TRIPS_HBO_x  7564 non-null   int64
 10  N_TRIPS_NHB_x  7564 non-null   int64
 11  X              7564 non-null   float64
 12  Y              7564 non-null   float64
 13  INCOME         7564 non-null   float64
 14  N_VEHICLES     7564 non-null   int64
 15  HHSIZE         7564 non-null   int64
 16  geometry       7564 non-null   object
 17  HOMETAZ        7564 non-null   int64
 18  N_TRIPS_y      7564 non-null   int64
 19  N_TRIPS_HBW_y  7564 non-null   int64
 20  N_TRIPS_HBO_y  7564 non-null   int64
 21  N_TRIPS_NHB_y  7564 non-null   int64
 22  N_WORKERS      7564 non-null   int64
 23  HHIDX          7564 non-null   int64
 24  AGE            7564 non-null   int64
 25  WORKS          7564 non-null   int64
 26  N_WORK_TOURS   7564 non-null   int64
 27  N_OTHER_TOURS  7564 non-null   int64
 28  N_TOURS        7564 non-null   int64
 29  N_TRIPS        7564 non-null   int64
 30  N_TRIPS_HBW    7564 non-null   int64
 31  N_TRIPS_HBO    7564 non-null   int64
 32  N_TRIPS_NHB    7564 non-null   int64
 33  HOMETAZi       7564 non-null   int64
 34  DTAZi          7564 non-null   int64
 35  AUTO_COST      7564 non-null   float64
 36  AUTO_DIST      7564 non-null   float64
 37  AUTO_TIME      7564 non-null   float64
 38  BIKE_TIME      7564 non-null   float64
 39  TRANSIT_FARE   7564 non-null   float64
 40  TRANSIT_IVTT   7564 non-null   float64
 41  TRANSIT_OVTT   7564 non-null   float64
 42  WALK_DIST      7564 non-null   float64
 43  WALK_TIME      7564 non-null   float64
dtypes: float64(12), int64(31), object(1)
memory usage: 2.9+ MB

Now we can assemble this data, along with the definitions of the alternatives, into a larch.DataFrames object. This object links information about the raw data and the choices into one data structure. It can handle multiple data formats (including idco and idca data simultanously), but here we’ll only be using a single idco format data table.

[12]:
# For clarity, we can define numbers as names for modes
DA = 1
SR = 2
Walk = 3
Bike = 4
Transit = 5
[13]:
dfs = larch.DataFrames(
    co=raw,
    alt_codes=[DA,SR,Walk,Bike,Transit],
    alt_names=['DA','SR','Walk','Bike','Transit'],
    ch_name='TOURMODE',
)

Model Definition

And then we are ready to create our model.

[14]:
m = larch.Model(dataservice=dfs)
m.title = "Exampville Work Tour Mode Choice v1"

We will explicitly define the set of utility functions we want to use. Because the DataFrames we are using to serve data to this model contains exclusively idco format data, we’ll use only the utility_co mapping to define a unique utility function for each alternative.

[15]:
m.utility_co[DA] = (
        + P.InVehTime * X.AUTO_TIME
        + P.Cost * X.AUTO_COST # dollars per mile
)

m.utility_co[SR] = (
        + P.ASC_SR
        + P.InVehTime * X.AUTO_TIME
        + P.Cost * (X.AUTO_COST * 0.5) # dollars per mile, half share
        + P("LogIncome:SR") * X("log(INCOME)")
)

m.utility_co[Walk] = (
        + P.ASC_Walk
        + P.NonMotorTime * X.WALK_TIME
        + P("LogIncome:Walk") * X("log(INCOME)")
)

m.utility_co[Bike] = (
        + P.ASC_Bike
        + P.NonMotorTime * X.BIKE_TIME
        + P("LogIncome:Bike") * X("log(INCOME)")
)

m.utility_co[Transit] = (
        + P.ASC_Transit
        + P.InVehTime * X.TRANSIT_IVTT
        + P.OutVehTime * X.TRANSIT_OVTT
        + P.Cost * X.TRANSIT_FARE
        + P("LogIncome:Transit") * X('log(INCOME)')
)

To write a nested logit mode, we’ll attach some nesting nodes to the model’s graph. Each new_node allows us to define the set of codes for the child nodes (elemental alternatives, or lower level nests) as well as giving the new nest a name and assigning a logsum parameter. The return value of this method is the node code for the newly created nest, which then can potenially be used as a child code when creating a higher level nest. We do this below, adding the ‘Car’ nest into the ‘Motor’ nest.

[16]:
Car = m.graph.new_node(parameter='Mu:Car', children=[DA,SR], name='Car')
NonMotor = m.graph.new_node(parameter='Mu:NonMotor', children=[Walk,Bike], name='NonMotor')
Motor = m.graph.new_node(parameter='Mu:Motor', children=[Car,Transit], name='Motor')

Let’s visually check on the nesting structure.

[17]:
m.graph
[17]:
%3 cluster_elemental 0 0 7 (7) NonMotor Mu:NonMotor 0->7 8 (8) Motor Mu:Motor 0->8 1 (1) DA 2 (2) SR 3 (3) Walk 4 (4) Bike 5 (5) Transit 6 (6) Car Mu:Car 6->1 6->2 7->3 7->4 8->5 8->6

The tour mode choice model’s choice variable is indicated by the code value in ‘TOURMODE’, and this can be defined for the model using choice_co_code.

[18]:
m.choice_co_code = 'TOURMODE'

We can also give a dictionary of availability conditions based on values in the idco data, using the availability_co_vars attribute. Alternatives that are always available can be indicated by setting the criterion to 1.

[19]:
m.availability_co_vars = {
    DA: 'AGE >= 16',
    SR: 1,
    Walk: 'WALK_TIME < 60',
    Bike: 'BIKE_TIME < 60',
    Transit: 'TRANSIT_FARE>0',
}

Then let’s prepare this data for estimation. Even though the data is already in memory, the load_data method is used to pre-process the data, extracting the required values, pre-computing the values of fixed expressions, and assembling the results into contiguous arrays suitable for computing the log likelihood values efficiently.

Model Estimation

[20]:
m.load_data()

We can check on some important statistics of this loaded data even before we estimate the model.

[21]:
m.dataframes.choice_avail_summary()
[21]:
name chosen available
1 DA 6052.0 7564
2 SR 810.0 7564
3 Walk 196.0 4179
4 Bike 72.0 7564
5 Transit 434.0 4199
< Total All Alternatives > 7564.0
[22]:
m.dataframes.data_co.statistics()
[22]:
nminimummaximummedianhistogrammeanstdevzerospositivesnegativesnonzero_minimumnonzero_maximumnonzero_meannonzero_stdev
AUTO_COST75640.1949264.307961.00945
1.206010.7548440756400.1949264.307961.206010.754844
AUTO_TIME75640.93000829.44157.61571
8.222874.581340756400.93000829.44158.222874.58134
BIKE_TIME75642.7846552.232113.5864
15.98279.297320756402.7846552.232115.98279.29732
TRANSIT_FARE756402.52.5
Histograms are purple if the data is represented as discrete values.
1.387821.242383365419902.52.52.50
TRANSIT_IVTT7564012.16681.44769
Histograms are orange if the zeros are numerous and have been excluded.
2.622773.380883365419900.95932212.16684.724613.26497
TRANSIT_OVTT75640.759059128.65237.0499
37.661423.21940756400.759059128.65237.661423.2194
WALK_TIME756411.1386208.92854.3454
63.930837.189307564011.1386208.92863.930837.1893
log(INCOME)75647.5903514.218110.6719
10.84141.013010756407.5903514.218110.84141.01301

If we are satisfied with the statistics we see above, we can go ahead and estimate the model.

[23]:
result = m.maximize_loglike(method='slsqp')

Iteration 046 [Converged]

LL = -3493.0397302098218

value initvalue nullvalue minimum maximum holdfast note best
ASC_Bike -0.258430 0.0 0.0 -inf inf 0 -0.258430
ASC_SR 1.422554 0.0 0.0 -inf inf 0 1.422554
ASC_Transit 6.753915 0.0 0.0 -inf inf 0 6.753915
ASC_Walk 8.621143 0.0 0.0 -inf inf 0 8.621143
Cost -0.175666 0.0 0.0 -inf inf 0 -0.175666
InVehTime -0.123722 0.0 0.0 -inf inf 0 -0.123722
LogIncome:Bike -0.196942 0.0 0.0 -inf inf 0 -0.196942
LogIncome:SR -0.193766 0.0 0.0 -inf inf 0 -0.193766
LogIncome:Transit -0.557110 0.0 0.0 -inf inf 0 -0.557110
LogIncome:Walk -0.522755 0.0 0.0 -inf inf 0 -0.522755
Mu:Car 0.259244 1.0 1.0 0.001 1.0 0 0.259244
Mu:Motor 0.801586 1.0 1.0 0.001 1.0 0 0.801586
Mu:NonMotor 0.853765 1.0 1.0 0.001 1.0 0 0.853765
NonMotorTime -0.265583 0.0 0.0 -inf inf 0 -0.265583
OutVehTime -0.254790 0.0 0.0 -inf inf 0 -0.254790
/Users/jeffnewman/opt/anaconda3/envs/garage37/lib/python3.7/site-packages/numpy/core/_methods.py:38: RuntimeWarning: invalid value encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)
/Users/jeffnewman/opt/anaconda3/envs/garage37/lib/python3.7/site-packages/numpy/core/_methods.py:38: RuntimeWarning: invalid value encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)

After we find the best fitting parameters, we can compute some variance-covariance statistics, incuding standard errors of the estimates and t statistics, using calculate_parameter_covariance.

[24]:
m.calculate_parameter_covariance()

Then we can review the results in a variety of report tables.

[25]:
m.parameter_summary()
[25]:
Value Std Err t Stat Signif Null Value
ASC_Bike -0.258  1.34 -0.19 0.00
ASC_SR  1.42  1.00  1.42 0.00
ASC_Transit  6.75  2.06  3.27 ** 0.00
ASC_Walk  8.62  1.14  7.57 *** 0.00
Cost -0.176  0.120 -1.47 0.00
InVehTime -0.124  0.0292 -4.24 *** 0.00
LogIncome:Bike -0.197  0.124 -1.59 0.00
LogIncome:SR -0.194  0.135 -1.43 0.00
LogIncome:Transit -0.557  0.169 -3.29 *** 0.00
LogIncome:Walk -0.523  0.100 -5.21 *** 0.00
Mu:Car  0.259  0.181 -4.10 *** 1.00
Mu:Motor  0.802  0.201 -0.99 1.00
Mu:NonMotor  0.854  0.112 -1.30 1.00
NonMotorTime -0.266  0.0163 -16.29 *** 0.00
OutVehTime -0.255  0.0646 -3.95 *** 0.00
[26]:
m.estimation_statistics()
[26]:
StatisticAggregatePer Case
Number of Cases7564
Log Likelihood at Convergence-3493.04-0.46
Log Likelihood at Null Parameters-10644.66-1.41
Rho Squared w.r.t. Null Parameters0.672

Save and Report Model

[27]:
report = larch.Reporter(title=m.title)
[28]:
report << '# Parameter Summary' << m.parameter_summary()
[28]:

Parameter Summary

Value Std Err t Stat Signif Null Value
ASC_Bike -0.258  1.34 -0.19 0.00
ASC_SR  1.42  1.00  1.42 0.00
ASC_Transit  6.75  2.06  3.27 ** 0.00
ASC_Walk  8.62  1.14  7.57 *** 0.00
Cost -0.176  0.120 -1.47 0.00
InVehTime -0.124  0.0292 -4.24 *** 0.00
LogIncome:Bike -0.197  0.124 -1.59 0.00
LogIncome:SR -0.194  0.135 -1.43 0.00
LogIncome:Transit -0.557  0.169 -3.29 *** 0.00
LogIncome:Walk -0.523  0.100 -5.21 *** 0.00
Mu:Car  0.259  0.181 -4.10 *** 1.00
Mu:Motor  0.802  0.201 -0.99 1.00
Mu:NonMotor  0.854  0.112 -1.30 1.00
NonMotorTime -0.266  0.0163 -16.29 *** 0.00
OutVehTime -0.255  0.0646 -3.95 *** 0.00
[29]:
report << "# Estimation Statistics" << m.estimation_statistics()
[29]:

Estimation Statistics

StatisticAggregatePer Case
Number of Cases7564
Log Likelihood at Convergence-3493.04-0.46
Log Likelihood at Null Parameters-10644.66-1.41
Rho Squared w.r.t. Null Parameters0.672
[30]:
report << "# Utility Functions" << m.utility_functions()
[30]:

Utility Functions

altformula
1
+ P.InVehTime * X.AUTO_TIME
+ P.Cost * X.AUTO_COST
2
+ P.ASC_SR
+ P.InVehTime * X.AUTO_TIME
+ P.Cost * 0.5 * X.AUTO_COST
+ P('LogIncome:SR') * X('log(INCOME)')
3
+ P.ASC_Walk
+ P.NonMotorTime * X.WALK_TIME
+ P('LogIncome:Walk') * X('log(INCOME)')
4
+ P.ASC_Bike
+ P.NonMotorTime * X.BIKE_TIME
+ P('LogIncome:Bike') * X('log(INCOME)')
5
+ P.ASC_Transit
+ P.InVehTime * X.TRANSIT_IVTT
+ P.OutVehTime * X.TRANSIT_OVTT
+ P.Cost * X.TRANSIT_FARE
+ P('LogIncome:Transit') * X('log(INCOME)')
[31]:
report.save(
    '/tmp/exampville_mode_choice.html',
    overwrite=True,
    metadata=m,
)
[31]:
'/tmp/exampville_mode_choice.html'