MTC Work Mode Choice Data

MTC Work Mode Choice Data

import larch, pandas, os, gzip
larch.__version__
'5.7.0'

The MTC sample dataset is the same data used in the Self Instructing Manual for discrete choice modeling:

The San Francisco Bay Area work mode choice data set comprises 5029 home-to-work commute trips in the San Francisco Bay Area. The data is drawn from the San Francisco Bay Area Household Travel Survey conducted by the Metropolitan Transportation Commission (MTC) in the spring and fall of 1990. This survey included a one day travel diary for each household member older than five years and detailed individual and household socio-demographic information.

from larch.data_warehouse import example_file
with gzip.open(example_file("MTCwork.csv.gz"), 'rt') as previewfile:
    print(*(next(previewfile) for x in range(10)))
casenum,altnum,chose,ivtt,ovtt,tottime,totcost,hhid,perid,numalts,dist,wkzone,hmzone,rspopden,rsempden,wkpopden,wkempden,vehavdum,femdum,age,drlicdum,noncadum,numveh,hhsize,hhinc,famtype,hhowndum,numemphh,numadlt,nmlt5,nm5to11,nm12to16,wkccbd,wknccbd,corredis,vehbywrk,vocc,wgt
 1,1,1,13.38,2,15.38,70.63,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,2,0,18.38,2,20.38,35.32,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,3,0,20.38,2,22.38,20.18,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,4,0,25.9,15.2,41.1,115.64,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,5,0,40.5,2,42.5,0,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 2,1,0,29.92,10,39.92,390.81,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
 2,2,0,34.92,10,44.92,195.4,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
 2,3,0,21.92,10,31.92,97.97,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
 2,4,1,22.96,14.2,37.16,185,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1

The first line of the file contains column headers. After that, each line represents an alternative available to a decision maker. In our sample data, we see the first 5 lines of data share a caseid of 1, indicating that they are 5 different alternatives available to the first decision maker. The identity of the alternatives is given by the number in the column altid. The observed choice of the decision maker is indicated in the column chose with a 1 in the appropriate row.

We can load this data easily using pandas. We’ll also set the index of the resulting DataFrame to be the case and alt identifiers.

df = pandas.read_csv(example_file("MTCwork.csv.gz"), index_col=['casenum','altnum'])
df.head(15)
chose ivtt ovtt tottime totcost hhid perid numalts dist wkzone ... numadlt nmlt5 nm5to11 nm12to16 wkccbd wknccbd corredis vehbywrk vocc wgt
casenum altnum
1 1 1 13.38 2.0 15.38 70.63 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
2 0 18.38 2.0 20.38 35.32 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
3 0 20.38 2.0 22.38 20.18 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
4 0 25.90 15.2 41.10 115.64 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
5 0 40.50 2.0 42.50 0.00 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
2 1 0 29.92 10.0 39.92 390.81 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
2 0 34.92 10.0 44.92 195.40 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
3 0 21.92 10.0 31.92 97.97 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
4 1 22.96 14.2 37.16 185.00 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
5 0 58.95 10.0 68.95 0.00 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
3 1 1 8.60 6.0 14.60 37.76 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
2 0 13.60 6.0 19.60 18.88 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
3 0 15.60 6.0 21.60 10.79 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
4 0 16.87 21.4 38.27 105.00 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
4 1 0 30.60 8.5 39.10 417.32 6 1 2 14.58 665 ... 2 1 0 0 1 0 0 1.00 0 1

15 rows × 36 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 22033 entries, (1, 1) to (5029, 6)
Data columns (total 36 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   chose     22033 non-null  int64  
 1   ivtt      22033 non-null  float64
 2   ovtt      22033 non-null  float64
 3   tottime   22033 non-null  float64
 4   totcost   22033 non-null  float64
 5   hhid      22033 non-null  int64  
 6   perid     22033 non-null  int64  
 7   numalts   22033 non-null  int64  
 8   dist      22033 non-null  float64
 9   wkzone    22033 non-null  int64  
 10  hmzone    22033 non-null  int64  
 11  rspopden  22033 non-null  float64
 12  rsempden  22033 non-null  float64
 13  wkpopden  22033 non-null  float64
 14  wkempden  22033 non-null  float64
 15  vehavdum  22033 non-null  int64  
 16  femdum    22033 non-null  int64  
 17  age       22033 non-null  int64  
 18  drlicdum  22033 non-null  int64  
 19  noncadum  22033 non-null  int64  
 20  numveh    22033 non-null  int64  
 21  hhsize    22033 non-null  int64  
 22  hhinc     22033 non-null  float64
 23  famtype   22033 non-null  int64  
 24  hhowndum  22033 non-null  int64  
 25  numemphh  22033 non-null  int64  
 26  numadlt   22033 non-null  int64  
 27  nmlt5     22033 non-null  int64  
 28  nm5to11   22033 non-null  int64  
 29  nm12to16  22033 non-null  int64  
 30  wkccbd    22033 non-null  int64  
 31  wknccbd   22033 non-null  int64  
 32  corredis  22033 non-null  int64  
 33  vehbywrk  22033 non-null  float64
 34  vocc      22033 non-null  int64  
 35  wgt       22033 non-null  int64  
dtypes: float64(11), int64(25)
memory usage: 6.3 MB
d = larch.DataFrames(df, ch='chose', crack=True)
d.info()
larch.DataFrames:  (not computation-ready)
  n_cases: 5029
  n_alts: 6
  data_ce: 5 variables, 22033 rows
  data_co: 31 variables
  data_av: <populated>
  data_ch: chose
d.alternative_codes()
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64', name='altnum')
d.alternative_names()

The set of all possible alternative codes is deduced automatically from all the values in the altid column. However, the alterative codes are not very descriptive when they are set automatically, as the csv data file does not have enough information to tell what each alternative code number means.

d.set_alternative_names({
    1: 'DA',
    2: 'SR2',
    3: 'SR3+',
    4: 'Transit',
    5: 'Bike',
    6: 'Walk',
})
d.alternative_names()
['DA', 'SR2', 'SR3+', 'Transit', 'Bike', 'Walk']