MTC Work Mode Choice Data

The MTC sample dataset is the same data used in the Self Instructing Manual for discrete choice modeling:

The San Francisco Bay Area work mode choice data set comprises 5029 home-to-work commute trips in the San Francisco Bay Area. The data is drawn from the San Francisco Bay Area Household Travel Survey conducted by the Metropolitan Transportation Commission (MTC) in the spring and fall of 1990. This survey included a one day travel diary for each household member older than five years and detailed individual and household socio-demographic information.
[1]:
import larch
import pandas
import os
import gzip

In this example we will import the MTC example dataset, starting from a csv text file in idca format. Suppose that data file is gzipped, named “MTCwork.csv.gz” and is located in the current directory (use os.getcwd() to see what is the current directory). For this example, we’ll use the example_file method to find the file that comes with Larch.

We can take a peek at the contents of the file, examining the first 10 lines:

[2]:
from larch.data_warehouse import example_file
[3]:
with gzip.open(example_file("MTCwork.csv.gz"), 'rt') as previewfile:
    print(*(next(previewfile) for x in range(10)))
casenum,altnum,chose,ivtt,ovtt,tottime,totcost,hhid,perid,numalts,dist,wkzone,hmzone,rspopden,rsempden,wkpopden,wkempden,vehavdum,femdum,age,drlicdum,noncadum,numveh,hhsize,hhinc,famtype,hhowndum,numemphh,numadlt,nmlt5,nm5to11,nm12to16,wkccbd,wknccbd,corredis,vehbywrk,vocc,wgt
 1,1,1,13.38,2,15.38,70.63,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,2,0,18.38,2,20.38,35.32,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,3,0,20.38,2,22.38,20.18,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,4,0,25.9,15.2,41.1,115.64,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,5,0,40.5,2,42.5,0,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 2,1,0,29.92,10,39.92,390.81,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
 2,2,0,34.92,10,44.92,195.4,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
 2,3,0,21.92,10,31.92,97.97,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
 2,4,1,22.96,14.2,37.16,185,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1

The first line of the file contains column headers. After that, each line represents an alternative available to a decision maker. In our sample data, we see the first 5 lines of data share a caseid of 1, indicating that they are 5 different alternatives available to the first decision maker. The identity of the alternatives is given by the number in the column altid. The observed choice of the decision maker is indicated in the column chose with a 1 in the appropriate row.

We can load this data easily using pandas. We’ll also set the index of the resulting DataFrame to be the case and alt identifiers.

[4]:
df = pandas.read_csv(example_file("MTCwork.csv.gz"))
df.set_index(['casenum','altnum'], inplace=True)
[5]:
df.head(12)
[5]:
chose ivtt ovtt tottime totcost hhid perid numalts dist wkzone ... numadlt nmlt5 nm5to11 nm12to16 wkccbd wknccbd corredis vehbywrk vocc wgt
casenum altnum
1 1 1 13.38 2.0 15.38 70.63 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
2 0 18.38 2.0 20.38 35.32 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
3 0 20.38 2.0 22.38 20.18 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
4 0 25.90 15.2 41.10 115.64 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
5 0 40.50 2.0 42.50 0.00 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
2 1 0 29.92 10.0 39.92 390.81 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
2 0 34.92 10.0 44.92 195.40 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
3 0 21.92 10.0 31.92 97.97 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
4 1 22.96 14.2 37.16 185.00 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
5 0 58.95 10.0 68.95 0.00 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
3 1 1 8.60 6.0 14.60 37.76 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
2 0 13.60 6.0 19.60 18.88 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1

12 rows × 36 columns

[6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 22033 entries, (1, 1) to (5029, 6)
Data columns (total 36 columns):
chose       22033 non-null int64
ivtt        22033 non-null float64
ovtt        22033 non-null float64
tottime     22033 non-null float64
totcost     22033 non-null float64
hhid        22033 non-null int64
perid       22033 non-null int64
numalts     22033 non-null int64
dist        22033 non-null float64
wkzone      22033 non-null int64
hmzone      22033 non-null int64
rspopden    22033 non-null float64
rsempden    22033 non-null float64
wkpopden    22033 non-null float64
wkempden    22033 non-null float64
vehavdum    22033 non-null int64
femdum      22033 non-null int64
age         22033 non-null int64
drlicdum    22033 non-null int64
noncadum    22033 non-null int64
numveh      22033 non-null int64
hhsize      22033 non-null int64
hhinc       22033 non-null float64
famtype     22033 non-null int64
hhowndum    22033 non-null int64
numemphh    22033 non-null int64
numadlt     22033 non-null int64
nmlt5       22033 non-null int64
nm5to11     22033 non-null int64
nm12to16    22033 non-null int64
wkccbd      22033 non-null int64
wknccbd     22033 non-null int64
corredis    22033 non-null int64
vehbywrk    22033 non-null float64
vocc        22033 non-null int64
wgt         22033 non-null int64
dtypes: float64(11), int64(25)
memory usage: 6.2 MB
[7]:
d = larch.DataFrames.from_idce(df, choice='chose', crack=True)
converting data_co to <class 'numpy.float64'>
[8]:
d.info()
larch.DataFrames:
  n_cases: 5029
  n_alts: 6
  data_ce:
    - ivtt
    - ovtt
    - tottime
    - totcost
  data_co:
    - hhid
    - perid
    - numalts
    - dist
    - wkzone
    - hmzone
    - rspopden
    - rsempden
    - wkpopden
    - wkempden
    - vehavdum
    - femdum
    - age
    - drlicdum
    - noncadum
    - numveh
    - hhsize
    - hhinc
    - famtype
    - hhowndum
    - numemphh
    - numadlt
    - nmlt5
    - nm5to11
    - nm12to16
    - wkccbd
    - wknccbd
    - corredis
    - vehbywrk
    - vocc
    - wgt
  data_av: <populated>
  data_ch: chose

By setting crack to True, Larch automatically analyzed the data to find variables that do not vary within cases, and transformed those into idco format variables. If you would prefer that Larch not do this you can omit this argument or set it to False. This is particularly important for larger datasets (the data sample included is only a tiny extract of the data that might be available for this kind of model), as breaking the data into seperate idca and idco parts is a relatively expensive operation, and it is not actually required for most model structures.

[9]:
d.alternative_codes()
[9]:
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64', name='altnum')
[10]:
d.alternative_names()

The set of all possible alternative codes is deduced automatically from all the values in the altid column. However, the alterative codes are not very descriptive when they are set automatically, as the csv data file does not have enough information to tell what each alternative code number means.

[11]:
d.set_alternative_names({
    1: 'DA',
    2: 'SR2',
    3: 'SR3+',
    4: 'Transit',
    5: 'Bike',
    6: 'Walk',
})
[12]:
d.alternative_names()
[12]:
['DA', 'SR2', 'SR3+', 'Transit', 'Bike', 'Walk']