# MTC Work Mode Choice Data

In [None]:
import os, gzip
import numpy as np, pandas as pd, xarray as xr
import larch.numba as lx

The MTC sample dataset is the same data used in the Self Instructing Manual {cite:p}`koppelman2006self` for discrete choice modeling:

> The San Francisco Bay Area work mode choice data set comprises 5029 home-to-work commute trips in the
> San Francisco Bay Area. The data is drawn from the San Francisco Bay Area Household Travel Survey
> conducted by the Metropolitan Transportation Commission (MTC) in the spring and fall of 1990. This
> survey included a one day travel diary for each household member older than five years and detailed
> individual and household socio-demographic information.

In this example we will import the MTC example dataset, starting from a csv
text file in [`idca`](idca) format.  Suppose that data file is gzipped, named "MTCwork.csv.gz" and
is located in the current directory (use `os.getcwd` to see what is the
current directory).  For this example, we'll use the `example_file` method to find
the file that comes with Larch.

We can take a peek at the contents of the file, examining the first 10 lines:

In [None]:
with gzip.open(lx.example_file("MTCwork.csv.gz"), 'rt') as previewfile:
    print(*(next(previewfile) for x in range(10)))

The first line of the file contains column headers. After that, each line represents
an alternative available to a decision maker. In our sample data, we see the first 5
lines of data share a ``caseid`` of 1, indicating that they are 5 different alternatives
available to the first decision maker.  The identity of the alternatives is given by the
number in the column ``altid``. The observed choice of the decision maker is
indicated in the column ``chose`` with a 1 in the appropriate row. 

We can load this data easily using pandas.  We'll also set the index of the resulting DataFrame to
be the case and alt identifiers.



In [None]:
df = pd.read_csv(lx.example_file("MTCwork.csv.gz"), index_col=['casenum','altnum'])

In [None]:
df.head(15)

To prepare this data for use with the latest version of Larch, we'll want
to convert this DataFrame into a `larch.Dataset`.  For [`idca`](idca) format like this,
we can use the `from_idca` constructor to do so easily.

In [None]:
ds = lx.Dataset.construct.from_idca(df)
ds

Larch can automatically analyze the data to find 
variables that do not vary across alternatives within
cases, and transform those into [`idco`](idco) format variables.  If you would prefer that
Larch not do this you can set the `crack` argument to `False`.  This is particularly
important for larger datasets (the data sample included is only a tiny extract of the data
that might be available for this kind of model), as breaking the data into 
separate [`idca`](idca) and [`idco`](idco) parts is
a relatively expensive operation, and it is not actually required for most model structures.

In [None]:
# TEST
assert ds['femdum'].dims == ('casenum',)
assert ds['femdum'].dtype.kind == 'i'
assert ds['ivtt'].dims == ('casenum','altnum')
assert ds['ivtt'].dtype.kind == 'f'
assert ds.dims == {'casenum': 5029, 'altnum': 6}
assert ds.dc.CASEID == 'casenum'
assert ds.dc.ALTID == 'altnum'

The set of all possible alternative codes is deduced automatically from all the values
in the `altnum` column.  However, the alterative codes are not very descriptive when
they are set automatically, as the csv data file does not have enough information to
tell what each alternative code number means.  We can use the `set_altnames` method
to attach more descriptive names.

In [None]:
ds = ds.dc.set_altnames({
    1:'DA', 2:'SR2', 3:'SR3+', 4:'Transit', 5:'Bike', 6:'Walk',
})
ds

In [None]:
# TEST
assert all(ds.coords['altnames'] == ['DA', 'SR2', 'SR3+', 'Transit', 'Bike', 'Walk'])