MTC Work Mode Choice Data
MTC Work Mode Choice Data¶
import larch, pandas, os, gzip
larch.__version__
'5.7.0'
The MTC sample dataset is the same data used in the Self Instructing Manual for discrete choice modeling:
The San Francisco Bay Area work mode choice data set comprises 5029 home-to-work commute trips in the San Francisco Bay Area. The data is drawn from the San Francisco Bay Area Household Travel Survey conducted by the Metropolitan Transportation Commission (MTC) in the spring and fall of 1990. This survey included a one day travel diary for each household member older than five years and detailed individual and household socio-demographic information.
from larch.data_warehouse import example_file
with gzip.open(example_file("MTCwork.csv.gz"), 'rt') as previewfile:
print(*(next(previewfile) for x in range(10)))
casenum,altnum,chose,ivtt,ovtt,tottime,totcost,hhid,perid,numalts,dist,wkzone,hmzone,rspopden,rsempden,wkpopden,wkempden,vehavdum,femdum,age,drlicdum,noncadum,numveh,hhsize,hhinc,famtype,hhowndum,numemphh,numadlt,nmlt5,nm5to11,nm12to16,wkccbd,wknccbd,corredis,vehbywrk,vocc,wgt
1,1,1,13.38,2,15.38,70.63,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
1,2,0,18.38,2,20.38,35.32,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
1,3,0,20.38,2,22.38,20.18,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
1,4,0,25.9,15.2,41.1,115.64,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
1,5,0,40.5,2,42.5,0,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
2,1,0,29.92,10,39.92,390.81,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
2,2,0,34.92,10,44.92,195.4,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
2,3,0,21.92,10,31.92,97.97,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
2,4,1,22.96,14.2,37.16,185,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
The first line of the file contains column headers. After that, each line represents
an alternative available to a decision maker. In our sample data, we see the first 5
lines of data share a caseid
of 1, indicating that they are 5 different alternatives
available to the first decision maker. The identity of the alternatives is given by the
number in the column altid
. The observed choice of the decision maker is
indicated in the column chose
with a 1 in the appropriate row.
We can load this data easily using pandas. We’ll also set the index of the resulting DataFrame to be the case and alt identifiers.
df = pandas.read_csv(example_file("MTCwork.csv.gz"), index_col=['casenum','altnum'])
df.head(15)
chose | ivtt | ovtt | tottime | totcost | hhid | perid | numalts | dist | wkzone | ... | numadlt | nmlt5 | nm5to11 | nm12to16 | wkccbd | wknccbd | corredis | vehbywrk | vocc | wgt | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
casenum | altnum | |||||||||||||||||||||
1 | 1 | 1 | 13.38 | 2.0 | 15.38 | 70.63 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 |
2 | 0 | 18.38 | 2.0 | 20.38 | 35.32 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 | |
3 | 0 | 20.38 | 2.0 | 22.38 | 20.18 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 | |
4 | 0 | 25.90 | 15.2 | 41.10 | 115.64 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 | |
5 | 0 | 40.50 | 2.0 | 42.50 | 0.00 | 2 | 1 | 2 | 7.69 | 664 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4.00 | 1 | 1 | |
2 | 1 | 0 | 29.92 | 10.0 | 39.92 | 390.81 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 |
2 | 0 | 34.92 | 10.0 | 44.92 | 195.40 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 | |
3 | 0 | 21.92 | 10.0 | 31.92 | 97.97 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 | |
4 | 1 | 22.96 | 14.2 | 37.16 | 185.00 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 | |
5 | 0 | 58.95 | 10.0 | 68.95 | 0.00 | 3 | 1 | 2 | 11.62 | 738 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1.00 | 0 | 1 | |
3 | 1 | 1 | 8.60 | 6.0 | 14.60 | 37.76 | 5 | 1 | 2 | 4.10 | 696 | ... | 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0.33 | 1 | 1 |
2 | 0 | 13.60 | 6.0 | 19.60 | 18.88 | 5 | 1 | 2 | 4.10 | 696 | ... | 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0.33 | 1 | 1 | |
3 | 0 | 15.60 | 6.0 | 21.60 | 10.79 | 5 | 1 | 2 | 4.10 | 696 | ... | 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0.33 | 1 | 1 | |
4 | 0 | 16.87 | 21.4 | 38.27 | 105.00 | 5 | 1 | 2 | 4.10 | 696 | ... | 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0.33 | 1 | 1 | |
4 | 1 | 0 | 30.60 | 8.5 | 39.10 | 417.32 | 6 | 1 | 2 | 14.58 | 665 | ... | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 1.00 | 0 | 1 |
15 rows × 36 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 22033 entries, (1, 1) to (5029, 6)
Data columns (total 36 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 chose 22033 non-null int64
1 ivtt 22033 non-null float64
2 ovtt 22033 non-null float64
3 tottime 22033 non-null float64
4 totcost 22033 non-null float64
5 hhid 22033 non-null int64
6 perid 22033 non-null int64
7 numalts 22033 non-null int64
8 dist 22033 non-null float64
9 wkzone 22033 non-null int64
10 hmzone 22033 non-null int64
11 rspopden 22033 non-null float64
12 rsempden 22033 non-null float64
13 wkpopden 22033 non-null float64
14 wkempden 22033 non-null float64
15 vehavdum 22033 non-null int64
16 femdum 22033 non-null int64
17 age 22033 non-null int64
18 drlicdum 22033 non-null int64
19 noncadum 22033 non-null int64
20 numveh 22033 non-null int64
21 hhsize 22033 non-null int64
22 hhinc 22033 non-null float64
23 famtype 22033 non-null int64
24 hhowndum 22033 non-null int64
25 numemphh 22033 non-null int64
26 numadlt 22033 non-null int64
27 nmlt5 22033 non-null int64
28 nm5to11 22033 non-null int64
29 nm12to16 22033 non-null int64
30 wkccbd 22033 non-null int64
31 wknccbd 22033 non-null int64
32 corredis 22033 non-null int64
33 vehbywrk 22033 non-null float64
34 vocc 22033 non-null int64
35 wgt 22033 non-null int64
dtypes: float64(11), int64(25)
memory usage: 6.3 MB
d = larch.DataFrames(df, ch='chose', crack=True)
d.info()
larch.DataFrames: (not computation-ready)
n_cases: 5029
n_alts: 6
data_ce: 5 variables, 22033 rows
data_co: 31 variables
data_av: <populated>
data_ch: chose
d.alternative_codes()
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64', name='altnum')
d.alternative_names()
The set of all possible alternative codes is deduced automatically from all the values
in the altid
column. However, the alterative codes are not very descriptive when
they are set automatically, as the csv data file does not have enough information to
tell what each alternative code number means.
d.set_alternative_names({
1: 'DA',
2: 'SR2',
3: 'SR3+',
4: 'Transit',
5: 'Bike',
6: 'Walk',
})
d.alternative_names()
['DA', 'SR2', 'SR3+', 'Transit', 'Bike', 'Walk']