Data for Discrete Choice
Contents
Data for Discrete Choice¶
import numpy as np
import pandas as pd
import xarray as xr
import sharrow as sh
import larch.numba as lx
Fundamental Data Formats¶
When working with discrete choice models in Larch, we will generally receive data to input into the system in one of two basic formats: the case-only (“idco”) format or the case-alternative (“idca”) format.
This are sometimes referred to as IDCase (each record contains all the information for mode choice over alternatives for a single trip) or IDCase-IDAlt (each record contains all the information for a single alternative available to each decision maker so there is one record for each alternative for each choice).
idco Format¶
In the idco case-only format, each record provides all the relevant information about an individual choice, including the variables related to the decision maker or the choice itself, as well as alternative-related variables for all available alternatives, and a variable indicating which alternative was chosen. This style of data has a variety of names in the choice modeling literature, including “IDCase”, “case-only”, and “wide”.
data_co = pd.read_csv("example-data/tiny_idco.csv", index_col="caseid")
Income | CarTime | CarCost | BusTime | BusCost | WalkTime | Chosen | |
---|---|---|---|---|---|---|---|
caseid | |||||||
1 | 30000 | 30 | 150 | 40 | 100 | 20 | Car |
2 | 30000 | 25 | 125 | 35 | 100 | 0 | Bus |
3 | 40000 | 40 | 125 | 50 | 75 | 30 | Walk |
4 | 50000 | 15 | 225 | 20 | 150 | 10 | Walk |
idca Format¶
In the idca case-alternative format, each record can include information on the variables related to the decision maker or the choice itself, the attributes of that particular alternative, and a choice variable that indicates whether the alternative was or was not chosen. This style of data has a variety of names in the choice modeling literature, including “IDCase-IDAlt”, “case-alternative”, and “tall”.
data_ca = pd.read_csv("example-data/tiny_idca.csv")
caseid | altid | Income | Time | Cost | Chosen | |
---|---|---|---|---|---|---|
0 | 1 | Car | 30000 | 30 | 150 | 1 |
1 | 1 | Bus | 30000 | 40 | 100 | 0 |
2 | 1 | Walk | 30000 | 20 | 0 | 0 |
3 | 2 | Car | 30000 | 25 | 125 | 0 |
4 | 2 | Bus | 30000 | 35 | 100 | 1 |
5 | 3 | Car | 40000 | 40 | 125 | 0 |
6 | 3 | Bus | 40000 | 50 | 75 | 0 |
7 | 3 | Walk | 40000 | 30 | 0 | 1 |
8 | 4 | Car | 50000 | 15 | 225 | 0 |
9 | 4 | Bus | 50000 | 20 | 150 | 0 |
10 | 4 | Walk | 50000 | 10 | 0 | 1 |
sparse vs dense¶
The idca
format actually has two technical variations, a sparse version and a
dense version. The table shown above is a sparse version, where any alterative that
is not available is simply missing from the data table. Thus, in caseid 2 above,
there are only 2 rows, not 3. By dropping these rows, this data storage is potentially
more efficient than the dense version. But, in cases where the number of missing alternatives
is managably small (less than half of all the data, certainly) it can be much more computationally
efficient to simply store and work with the dense array.
In Larch, these two distinct sub-types of idca data are labeled so
that the dense version labeled as idca
and the sparse version
labeled as idce
.
Data Conversion¶
Converting between idca
format data and idco
format in Python can be super easy if the alternative
id’s are stored appropriately in a two-level MultiIndex. In that case, we can simply stack
or unstack
the DataFrame, and change formats. This is typically more readily available when switching from idca
to idco
formats, as the alterative id’s typically appear in a column of the DataFrame that can be used for indexing.
data_ca.set_index(['caseid', 'altid']).unstack()
Income | Time | Cost | Chosen | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
altid | Bus | Car | Walk | Bus | Car | Walk | Bus | Car | Walk | Bus | Car | Walk |
caseid | ||||||||||||
1 | 30000.0 | 30000.0 | 30000.0 | 40.0 | 30.0 | 20.0 | 100.0 | 150.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 30000.0 | 30000.0 | NaN | 35.0 | 25.0 | NaN | 100.0 | 125.0 | NaN | 1.0 | 0.0 | NaN |
3 | 40000.0 | 40000.0 | 40000.0 | 50.0 | 40.0 | 30.0 | 75.0 | 125.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 50000.0 | 50000.0 | 50000.0 | 20.0 | 15.0 | 10.0 | 150.0 | 225.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Getting our original idco
data into idca
format is not so clean, as there’s no analagous
set_columns
method in pandas, and even if there were, the alternative codes are not typically
neatly arranged in a row of data. We can force it to work, but it’s not pretty.
forced_ca = data_co.T.set_index(
pd.MultiIndex.from_tuples([
['Car', 'Income'],
['Car','Time'],
['Car','Cost'],
['Bus','Time'],
['Bus','Cost'],
['Walk','Time'],
['Car', 'Chosen'],
], names=('alt','var'))
).T.stack(0)
forced_ca[['Chosen', 'Income']] = forced_ca[['Chosen', 'Income']].groupby("caseid").transform(
lambda x: x.fillna(x.value_counts().index[0])
)
forced_ca['Chosen'] = (
forced_ca['Chosen'] == forced_ca.index.get_level_values('alt')
).astype(float)
forced_ca
var | Chosen | Cost | Income | Time | |
---|---|---|---|---|---|
caseid | alt | ||||
1 | Bus | 0.0 | 100 | 30000 | 40 |
Car | 1.0 | 150 | 30000 | 30 | |
Walk | 0.0 | NaN | 30000 | 20 | |
2 | Bus | 1.0 | 100 | 30000 | 35 |
Car | 0.0 | 125 | 30000 | 25 | |
Walk | 0.0 | NaN | 30000 | 0 | |
3 | Bus | 0.0 | 75 | 40000 | 50 |
Car | 0.0 | 125 | 40000 | 40 | |
Walk | 1.0 | NaN | 40000 | 30 | |
4 | Bus | 0.0 | 150 | 50000 | 20 |
Car | 0.0 | 225 | 50000 | 15 | |
Walk | 1.0 | NaN | 50000 | 10 |
Practical Data Formating in Larch¶
The data formats described above are relevant when storing data in a tabular (two-dimensional) format. This is quite common and generally expected, especially for data exchange between most software tools, but Larch doesn’t require you to choose one or the other.
Instead, Larch uses a Dataset
structure based
on xarray
, to store and use a collection of relevant variables, and
each variable can be stored in either |idco| or |idca| format, as
appropriate.
dataset = lx.merge(
[
data_co[['Income', 'Chosen']].to_xarray(),
data_ca.set_index(['caseid', 'altid'])[['Time', 'Cost']].to_xarray(),
],
caseid='caseid',
alts='altid',
)
dataset
<xarray.Dataset> Dimensions: (caseid: 4, altid: 3) Coordinates: * caseid (caseid) int64 1 2 3 4 * altid (altid) object 'Bus' 'Car' 'Walk' Data variables: Income (caseid) int64 30000 30000 40000 50000 Chosen (caseid) object 'Car' 'Bus' 'Walk' 'Walk' Time (caseid, altid) float64 40.0 30.0 20.0 35.0 ... 30.0 20.0 15.0 10.0 Cost (caseid, altid) float64 100.0 150.0 0.0 100.0 ... 150.0 225.0 0.0 Attributes: _caseid_: caseid _altid_: altid
As we saw above, it’s quite easy to move from idca to idco format,
and Larch can apply those transformations automatically when loading idca
data. In the example below, note that the Income
variable has automatically
been collapsed to idco, while the other variables remain as idca.
lx.Dataset.construct.from_idca(
data_ca.set_index(['caseid', 'altid']),
)
<xarray.Dataset> Dimensions: (caseid: 4, altid: 3) Coordinates: * caseid (caseid) int64 1 2 3 4 * altid (altid) int64 1 2 3 altnames (altid) object 'Bus' 'Car' 'Walk' Data variables: Income (caseid) int64 30000 30000 40000 50000 Time (caseid, altid) int64 40 30 20 35 25 ... 40 30 20 15 10 Cost (caseid, altid) int64 100 150 0 100 125 ... 125 0 150 225 0 Chosen (caseid, altid) int64 0 1 0 1 0 -9223372036854775808 0 0 1 0 0 1 _avail_ (caseid, altid) int8 1 1 1 1 1 0 1 1 1 1 1 1 Attributes: _caseid_: caseid _altid_: altid
Loading data in sparse format is as easy as swapping out
from_idca
for
from_idce
. The resulting
dataset will have a similar collection of variables, but
each idca variable is stored in a one-dimensional array,
using a variety of the compressed sparse row data format.
lx.Dataset.construct.from_idce(
data_ca.set_index(['caseid', 'altid']),
)
<xarray.Dataset> Dimensions: (caseid: 4, _casealt_: 11, altid: 3, _caseptr_: 5) Coordinates: * caseid (caseid) int64 1 2 3 4 * altid (altid) object 'Bus' 'Car' 'Walk' alt_idx (_casealt_) int8 1 0 2 1 0 1 0 2 1 0 2 * _caseptr_ (_caseptr_) int64 0 3 5 8 11 Dimensions without coordinates: _casealt_ Data variables: Income (caseid) int64 30000 30000 40000 50000 Time (_casealt_) int64 30 40 20 25 35 40 50 30 15 20 10 Cost (_casealt_) int64 150 100 0 125 100 125 75 0 225 150 0 Chosen (_casealt_) int64 1 0 0 0 1 0 0 1 0 0 1 Attributes: _exclude_dims_: ('caseid', 'altid', '_caseptr_') _caseid_: caseid _altid_: altid _casealt_: _casealt_ _alt_idx_: alt_idx _caseptr_: _caseptr_
Data Encoding¶
For the most part, data used in the utility functions of discrete choice models enters into the utility function as part of a linear-in-parameters function. That is, we have some “data” that expresses an attribute of some part of the transportation system as a number, we multiply that by some numerical parameter that will be estimated, and we sum up the total over all the data-times-parameter operations. This kind of structure is known as “linear algebra” and it’s something computers can do super fast, as long as all the data and all the parameters are queued up in memory in the right formats. So, typically it is optimal to pre-compute the “data” part of the process into one large contiguous array of floating point values, regardless if the values otherwise seem to be binary or integers. Most tools, such as Larch, will do much of this work for you, so you don’t need to worry about it too much.
There are two notable exceptions to this guideline:
choices: the data that represents the observed choices, which are inherently categorical
availablity: data that represents the availability of each choice, which is inherently boolean
Categorical Encoding¶
When we are looking at discrete choices, it is natural to employ a categorical data type for at least the “choice” data itself, if not for other columns as well. Pandas can convert columns to categorical data simply by assigning the type “category”.
choices = data_co['Chosen'].astype("category")
choices
caseid
1 Car
2 Bus
3 Walk
4 Walk
Name: Chosen, dtype: category
Categories (3, object): ['Bus', 'Car', 'Walk']
Once we have categorical data, if we like we can work with the underlying code values instead of the original raw data.
choices.cat.codes
caseid
1 1
2 0
3 2
4 2
dtype: int8
The cat.categories
attribute contains the array of values matching each of the code.
choices.cat.categories
Index(['Bus', 'Car', 'Walk'], dtype='object')
When using astype("category")
there’s no control over the ordering of the categories. If we want
to control the apparent order (e.g. we already have codes defined elsewhere such that Car is 1, Bus is 2, and walk is 3) then we can explicitly set the category value positions using pd.CategoricalDtype
instead of "category"
.
Note that the cat.codes
numbers used internally by categoricals start with zero as standard in Python,
so if you want codes to start with 1 you need to include a dummy placeholder for zero.
choices1 = data_co['Chosen'].astype(pd.CategoricalDtype(['_','Car','Bus','Walk']))
choices1
caseid
1 Car
2 Bus
3 Walk
4 Walk
Name: Chosen, dtype: category
Categories (4, object): ['_', 'Car', 'Bus', 'Walk']
choices1.cat.codes
caseid
1 1
2 2
3 3
4 3
dtype: int8
To be clear, by asserting the placement ordering of alternative like this, we are not simultaneously asserting that the alternatives are ordinal. Put another way, we are forcing Car to be coded as 1 and Bus to be coded as 2, but we are not saying that Car is less than Bus. Pandas categoricals can allow this, by adding ordered=True
to the CategoricalDtype.
pd.CategoricalDtype(['NoCars','1Car','2Cars','3+Cars'], ordered=True)
CategoricalDtype(categories=['NoCars', '1Car', '2Cars', '3+Cars'], ordered=True)
One Hot Encoding¶
One-hot encoding, also known as dummy variables, is the creation of a seperate binary-valued column for every categorical value. We can convert a categorical data column into a set of one-hot encoded columns using the get_dummies
function.
pd.get_dummies(choices)
Bus | Car | Walk | |
---|---|---|---|
caseid | |||
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
It’s not required to have first converted the data to a categorical data type.
pd.get_dummies(data_co['Chosen'])
Bus | Car | Walk | |
---|---|---|---|
caseid | |||
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
Encoding with xarray
¶
The xarray
library doesn’t use formal “categorical” datatypes, but we can still use
the get_dummies
function to explode choice and availability data as needed.
dataset['Chosen_ca'] = lx.DataArray(
pd.get_dummies(data_co['Chosen']).rename_axis(columns="altid")
)
dataset
<xarray.Dataset> Dimensions: (caseid: 4, altid: 3) Coordinates: * caseid (caseid) int64 1 2 3 4 * altid (altid) object 'Bus' 'Car' 'Walk' Data variables: Income (caseid) int64 30000 30000 40000 50000 Chosen (caseid) object 'Car' 'Bus' 'Walk' 'Walk' Time (caseid, altid) float64 40.0 30.0 20.0 35.0 ... 20.0 15.0 10.0 Cost (caseid, altid) float64 100.0 150.0 0.0 100.0 ... 150.0 225.0 0.0 Chosen_ca (caseid, altid) uint8 0 1 0 1 0 0 0 0 1 0 0 1 Attributes: _caseid_: caseid _altid_: altid