MTC Work Mode Choice Data¶

import larch, pandas, os, gzip
larch.__version__

'5.7.0'

The MTC sample dataset is the same data used in the Self Instructing Manual for discrete choice modeling:

The San Francisco Bay Area work mode choice data set comprises 5029 home-to-work commute trips in the San Francisco Bay Area. The data is drawn from the San Francisco Bay Area Household Travel Survey conducted by the Metropolitan Transportation Commission (MTC) in the spring and fall of 1990. This survey included a one day travel diary for each household member older than five years and detailed individual and household socio-demographic information.

from larch.data_warehouse import example_file

with gzip.open(example_file("MTCwork.csv.gz"), 'rt') as previewfile:
    print(*(next(previewfile) for x in range(10)))

casenum,altnum,chose,ivtt,ovtt,tottime,totcost,hhid,perid,numalts,dist,wkzone,hmzone,rspopden,rsempden,wkpopden,wkempden,vehavdum,femdum,age,drlicdum,noncadum,numveh,hhsize,hhinc,famtype,hhowndum,numemphh,numadlt,nmlt5,nm5to11,nm12to16,wkccbd,wknccbd,corredis,vehbywrk,vocc,wgt
 1,1,1,13.38,2,15.38,70.63,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,2,0,18.38,2,20.38,35.32,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,3,0,20.38,2,22.38,20.18,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,4,0,25.9,15.2,41.1,115.64,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 1,5,0,40.5,2,42.5,0,2,1,2,7.69,664,726,15.52,9.96,37.26,3.48,1,0,35,1,0,4,1,42.5,7,0,1,1,0,0,0,0,0,0,4,1,1
 2,1,0,29.92,10,39.92,390.81,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
 2,2,0,34.92,10,44.92,195.4,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
 2,3,0,21.92,10,31.92,97.97,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1
 2,4,1,22.96,14.2,37.16,185,3,1,2,11.62,738,9,35.81,53.33,32.91,764.19,1,0,40,1,0,1,1,17.5,7,0,1,1,0,0,0,1,0,1,1,0,1

The first line of the file contains column headers. After that, each line represents an alternative available to a decision maker. In our sample data, we see the first 5 lines of data share a caseid of 1, indicating that they are 5 different alternatives available to the first decision maker. The identity of the alternatives is given by the number in the column altid. The observed choice of the decision maker is indicated in the column chose with a 1 in the appropriate row.

We can load this data easily using pandas. We’ll also set the index of the resulting DataFrame to be the case and alt identifiers.

df = pandas.read_csv(example_file("MTCwork.csv.gz"), index_col=['casenum','altnum'])

df.head(15)

		chose	ivtt	ovtt	tottime	totcost	hhid	perid	numalts	dist	wkzone	...	numadlt	nmlt5	nm5to11	nm12to16	wkccbd	wknccbd	corredis	vehbywrk	vocc	wgt
casenum	altnum
1	1	1	13.38	2.0	15.38	70.63	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
	2	0	18.38	2.0	20.38	35.32	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
	3	0	20.38	2.0	22.38	20.18	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
	4	0	25.90	15.2	41.10	115.64	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
	5	0	40.50	2.0	42.50	0.00	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
2	1	0	29.92	10.0	39.92	390.81	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
	2	0	34.92	10.0	44.92	195.40	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
	3	0	21.92	10.0	31.92	97.97	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
	4	1	22.96	14.2	37.16	185.00	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
	5	0	58.95	10.0	68.95	0.00	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
3	1	1	8.60	6.0	14.60	37.76	5	1	2	4.10	696	...	3	2	0	0	0	1	0	0.33	1	1
	2	0	13.60	6.0	19.60	18.88	5	1	2	4.10	696	...	3	2	0	0	0	1	0	0.33	1	1
	3	0	15.60	6.0	21.60	10.79	5	1	2	4.10	696	...	3	2	0	0	0	1	0	0.33	1	1
	4	0	16.87	21.4	38.27	105.00	5	1	2	4.10	696	...	3	2	0	0	0	1	0	0.33	1	1
4	1	0	30.60	8.5	39.10	417.32	6	1	2	14.58	665	...	2	1	0	0	1	0	0	1.00	0	1

15 rows × 36 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 22033 entries, (1, 1) to (5029, 6)
Data columns (total 36 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 chose     22033 non-null  int64  
 ivtt      22033 non-null  float64
 ovtt      22033 non-null  float64
 tottime   22033 non-null  float64
 totcost   22033 non-null  float64
 hhid      22033 non-null  int64  
 perid     22033 non-null  int64  
 numalts   22033 non-null  int64  
 dist      22033 non-null  float64
 wkzone    22033 non-null  int64  
hmzone    22033 non-null  int64  
rspopden  22033 non-null  float64
rsempden  22033 non-null  float64
wkpopden  22033 non-null  float64
wkempden  22033 non-null  float64
vehavdum  22033 non-null  int64  
femdum    22033 non-null  int64  
age       22033 non-null  int64  
drlicdum  22033 non-null  int64  
noncadum  22033 non-null  int64  
numveh    22033 non-null  int64  
hhsize    22033 non-null  int64  
hhinc     22033 non-null  float64
famtype   22033 non-null  int64  
hhowndum  22033 non-null  int64  
numemphh  22033 non-null  int64  
numadlt   22033 non-null  int64  
nmlt5     22033 non-null  int64  
nm5to11   22033 non-null  int64  
nm12to16  22033 non-null  int64  
wkccbd    22033 non-null  int64  
wknccbd   22033 non-null  int64  
corredis  22033 non-null  int64  
vehbywrk  22033 non-null  float64
vocc      22033 non-null  int64  
wgt       22033 non-null  int64  
dtypes: float64(11), int64(25)
memory usage: 6.3 MB

d = larch.DataFrames(df, ch='chose', crack=True)
d.info()

larch.DataFrames:  (not computation-ready)
  n_cases: 5029
  n_alts: 6
  data_ce: 5 variables, 22033 rows
  data_co: 31 variables
  data_av: <populated>
  data_ch: chose

d.alternative_codes()

Int64Index([1, 2, 3, 4, 5, 6], dtype='int64', name='altnum')

d.alternative_names()

The set of all possible alternative codes is deduced automatically from all the values in the altid column. However, the alterative codes are not very descriptive when they are set automatically, as the csv data file does not have enough information to tell what each alternative code number means.

d.set_alternative_names({
'DA',
'SR2',
'SR3+',
'Transit',
'Bike',
'Walk',
})

d.alternative_names()

['DA', 'SR2', 'SR3+', 'Transit', 'Bike', 'Walk']

v5.7.0

MTC Work Mode Choice Data

MTC Work Mode Choice Data¶