MTC Work Mode Choice Data
MTC Work Mode Choice Data¶
import os, gzip
import numpy as np, pandas as pd, xarray as xr
import larch.numba as lx
/home/runner/work/larch/larch/larch/larch/numba/model.py:23: UserWarning:
### larch.numba is experimental, and not feature-complete ###
the first time you import on a new system, this package will
compile optimized binaries for your machine, which may take
a little while, please be patient
warnings.warn( ### EXPERIMENTAL ### )
The MTC sample dataset is the same data used in the Self Instructing Manual [Koppelman and Bhat, 2006] for discrete choice modeling:
The San Francisco Bay Area work mode choice data set comprises 5029 home-to-work commute trips in the San Francisco Bay Area. The data is drawn from the San Francisco Bay Area Household Travel Survey conducted by the Metropolitan Transportation Commission (MTC) in the spring and fall of 1990. This survey included a one day travel diary for each household member older than five years and detailed individual and household socio-demographic information.
In this example we will import the MTC example dataset, starting from a csv
text file in idca
format. Suppose that data file is gzipped, named “MTCwork.csv.gz” and
is located in the current directory (use os.getcwd
to see what is the
current directory). For this example, we’ll use the example_file
method to find
the file that comes with Larch.
We can take a peek at the contents of the file, examining the first 10 lines:
with gzip.open(lx.example_file("MTCwork.csv.gz"), 'rt') as previewfile:
print(*(next(previewfile) for x in range(10)))
casenum,altnum,chose,ivtt,ovtt,tottime,totcost,hhid,perid,numalts,dist,wkzone,hmzone,rspopden,rsempden,wkpopden,wkempden,vehavdum,femdum,age,drlicdum,noncadum,numveh,hhsize,hhinc,famtype,hhowndum,numemphh,numadlt,nmlt5,nm5to11,nm12to16,wkccbd,wknccbd,corredis,vehbywrk,vocc,wgt
1,1,1,13.38,2,15.38,70.63,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
1,2,0,18.38,2,20.38,35.32,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
1,3,0,20.38,2,22.38,20.18,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
1,4,0,25.9,15.2,41.1,115.64,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
1,5,0,40.5,2,42.5,0,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
2,1,0,29.92,10,39.92,390.81,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
2,2,0,34.92,10,44.92,195.4,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
2,3,0,21.92,10,31.92,97.97,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
2,4,1,22.96,14.2,37.16,185,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
The first line of the file contains column headers. After that, each line represents
an alternative available to a decision maker. In our sample data, we see the first 5
lines of data share a caseid
of 1, indicating that they are 5 different alternatives
available to the first decision maker. The identity of the alternatives is given by the
number in the column altid
. The observed choice of the decision maker is
indicated in the column chose
with a 1 in the appropriate row.
We can load this data easily using pandas. We’ll also set the index of the resulting DataFrame to be the case and alt identifiers.
df = pd.read_csv(lx.example_file("MTCwork.csv.gz"), index_col=['casenum','altnum'])
df.head(15)
chose | ivtt | ovtt | tottime | totcost | hhid | perid | numalts | dist | wkzone | ... | numadlt | nmlt5 | nm5to11 | nm12to16 | wkccbd | wknccbd | corredis | vehbywrk | vocc | wgt | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
casenum | altnum | |||||||||||||||||||||
1 | 1 | 1 | 13.38 | 2.0 | 15.38 | 70.63 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 |
2 | 0 | 18.38 | 2.0 | 20.38 | 35.32 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 | |
3 | 0 | 20.38 | 2.0 | 22.38 | 20.18 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 | |
4 | 0 | 25.90 | 15.2 | 41.10 | 115.64 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 | |
5 | 0 | 40.50 | 2.0 | 42.50 | 0.00 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 | |
2 | 1 | 0 | 29.92 | 10.0 | 39.92 | 390.81 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 |
2 | 0 | 34.92 | 10.0 | 44.92 | 195.40 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 | |
3 | 0 | 21.92 | 10.0 | 31.92 | 97.97 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 | |
4 | 1 | 22.96 | 14.2 | 37.16 | 185.00 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 | |
5 | 0 | 58.95 | 10.0 | 68.95 | 0.00 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 | |
3 | 1 | 1 | 8.60 | 6.0 | 14.60 | 37.76 | 5 | 1 | 2 | 4.10 | 696 | ... | 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0.33 | 1 | 1 |
2 | 0 | 13.60 | 6.0 | 19.60 | 18.88 | 5 | 1 | 2 | 4.10 | 696 | ... | 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0.33 | 1 | 1 | |
3 | 0 | 15.60 | 6.0 | 21.60 | 10.79 | 5 | 1 | 2 | 4.10 | 696 | ... | 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0.33 | 1 | 1 | |
4 | 0 | 16.87 | 21.4 | 38.27 | 105.00 | 5 | 1 | 2 | 4.10 | 696 | ... | 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0.33 | 1 | 1 | |
4 | 1 | 0 | 30.60 | 8.5 | 39.10 | 417.32 | 6 | 1 | 2 | 14.58 | 665 | ... | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 1.00 | 0 | 1 |
15 rows × 36 columns
To prepare this data for use with the latest version of Larch, we’ll want
to convert this DataFrame into a larch.Dataset
. For idca
format like this,
we can use the from_idca
constructor to do so easily.
ds = lx.Dataset.construct.from_idca(df)
ds
<xarray.Dataset> Dimensions: (casenum: 5029, altnum: 6) Coordinates: * casenum (casenum) int64 1 2 3 4 5 6 7 ... 5024 5025 5026 5027 5028 5029 * altnum (altnum) int64 1 2 3 4 5 6 Data variables: (12/37) chose (casenum, altnum) int64 1 0 0 0 0 ... 0 0 0 0 0 ivtt (casenum, altnum) float64 13.38 18.38 20.38 25.9 ... 1.59 6.55 0.0 ovtt (casenum, altnum) float64 2.0 2.0 2.0 15.2 ... 4.5 16.0 4.5 0.0 tottime (casenum, altnum) float64 15.38 20.38 22.38 ... 17.59 11.05 19.1 totcost (casenum, altnum) float64 70.63 35.32 20.18 115.6 ... 75.0 0.0 0.0 hhid (casenum) int64 2 3 5 6 8 8 12 ... 9429 9430 9433 9434 9436 9438 ... ... wknccbd (casenum) int64 0 0 1 0 1 1 1 0 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 corredis (casenum) int64 0 1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 vehbywrk (casenum) float64 4.0 1.0 0.33 1.0 0.0 0.0 ... 2.0 2.0 2.0 3.0 3.0 vocc (casenum) int64 1 0 1 0 2 0 1 1 1 1 0 1 ... 1 1 2 1 1 0 1 2 1 1 1 wgt (casenum) int64 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1 1 _avail_ (casenum, altnum) int8 1 1 1 1 1 0 1 1 1 1 ... 1 1 0 1 1 1 1 1 1 1 Attributes: _caseid_: casenum _altid_: altnum
Larch can automatically analyze the data to find
variables that do not vary across alternatives within
cases, and transform those into idco
format variables. If you would prefer that
Larch not do this you can set the crack
argument to False
. This is particularly
important for larger datasets (the data sample included is only a tiny extract of the data
that might be available for this kind of model), as breaking the data into
separate idca
and idco
parts is
a relatively expensive operation, and it is not actually required for most model structures.
The set of all possible alternative codes is deduced automatically from all the values
in the altnum
column. However, the alterative codes are not very descriptive when
they are set automatically, as the csv data file does not have enough information to
tell what each alternative code number means. We can use the set_altnames
method
to attach more descriptive names.
ds = ds.dc.set_altnames({
1:'DA', 2:'SR2', 3:'SR3+', 4:'Transit', 5:'Bike', 6:'Walk',
})
ds
<xarray.Dataset> Dimensions: (casenum: 5029, altnum: 6) Coordinates: * casenum (casenum) int64 1 2 3 4 5 6 7 ... 5024 5025 5026 5027 5028 5029 * altnum (altnum) int64 1 2 3 4 5 6 altnames (altnum) <U7 'DA' 'SR2' 'SR3+' 'Transit' 'Bike' 'Walk' Data variables: (12/37) chose (casenum, altnum) int64 1 0 0 0 0 ... 0 0 0 0 0 ivtt (casenum, altnum) float64 13.38 18.38 20.38 25.9 ... 1.59 6.55 0.0 ovtt (casenum, altnum) float64 2.0 2.0 2.0 15.2 ... 4.5 16.0 4.5 0.0 tottime (casenum, altnum) float64 15.38 20.38 22.38 ... 17.59 11.05 19.1 totcost (casenum, altnum) float64 70.63 35.32 20.18 115.6 ... 75.0 0.0 0.0 hhid (casenum) int64 2 3 5 6 8 8 12 ... 9429 9430 9433 9434 9436 9438 ... ... wknccbd (casenum) int64 0 0 1 0 1 1 1 0 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 corredis (casenum) int64 0 1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 vehbywrk (casenum) float64 4.0 1.0 0.33 1.0 0.0 0.0 ... 2.0 2.0 2.0 3.0 3.0 vocc (casenum) int64 1 0 1 0 2 0 1 1 1 1 0 1 ... 1 1 2 1 1 0 1 2 1 1 1 wgt (casenum) int64 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1 1 _avail_ (casenum, altnum) int8 1 1 1 1 1 0 1 1 1 1 ... 1 1 0 1 1 1 1 1 1 1 Attributes: _caseid_: casenum _altid_: altnum