DataFrames¶

A DataFrames is essentially a collection of related pandas.DataFrames, which represent idco Format and idca Format data features.

class larch.DataFrames(co=None, *args, ca=None, ce=None, av=None, ch=None, wt=None, data_co=None, data_ca=None, data_ce=None, data_av=None, data_ch=None, data_wt=None, alt_names=None, alt_codes=None, crack=False, av_name=None, ch_name=None, wt_name=None, av_as_ce=None, ch_as_ce=None, sys_alts=None, computational=False, caseindex_name=u’_caseid_’, altindex_name=u’_altid_’, autoscale_weights=False)

A structured class to hold multi-format discrete choice data.

Parameters: co (pandas.DataFrame) – A dataframe containing idco format data, with one row per case. The index contains the caseid’s. ca (pandas.DataFrame) – A dataframe containing idca format data, with one row per alternative. The index should be a two-level multi-index, with the first level containing the caseid’s and the second level containing the altid’s. av (pandas.DataFrame or pandas.Series or True, optional) – Alternative availability data. This can be given as a pandas.DataFrame in idco format, with one row per case and one column per alternative, where the index contains the caseid’s, and the columns contain the altid’s. Or, it can be given as a pandas.Series in idca format, with one row per alternative, and an index that is a two-level multi-index, with the first level containing the caseid’s and the second level containing the altid’s. Or, set to True to make all alternatives available for all cases. If not given, then data_av will not be defined unless it can be inferred from missing rows in ca. ch (pandas.DataFrame or pandas.Series or str, optional) – Choice data. This can be given as a pandas.DataFrame in idco format, with one row per case and one column per alternative, where the index contains the caseid’s, and the columns contain the altid’s. Or, it can be given as a pandas.Series in idca format, with one row per alternative, and an index that is a two-level multi-index, with the first level containing the caseid’s and the second level containing the altid’s. Or, if given as a str, then that named column is found in the ca dataframe if it appears there and used as the choice. Otherwise, if the named column is found in the co dataframe, then the codes in that column are used to identify the choices. If not given, data_ch is not set. wt (pandas.DataFrame or pandas.Series or str, optional) – Case weights. This can only be given in idco format, either as a pandas.DataFrame with a single column, or as a pandas.Series. Or, if given as a str, then that named column is found in the co or ca dataframe if it appears there and used as the weight. If not given, data_wt is not set. alt_names (Sequence[str]) – A sequence of alternative names as str. alt_codes (Sequence[int]) – A sequence of alternative codes. crack (bool, default False) – Whether to pre-process ca data to identify variables that do not vary within cases, and move them to a new co dataframe. This can result in more computationally efficient model estimation, but the cracking process can be slow for large data sets. av_name (str, optional) – A name to use for the availability variable. If not given, it is inferred from the av argument if possible. ch_name (str, optional) – A name to use for the choice variable. If not given, it is inferred from the ch argument if possible. wt_name (str, optional) – A name to use for the weight variable. If not given, it is inferred from the wt argument if possible. autoscale_weights (bool, default False) – Call autoscale_weights on the DataFrames after initialization. Note that this will not only scale an explicitly given wt, but it will also extract implied weights from the ch as well.
alternative_codes(self)

The alternative codes.

alternative_names(self)

The alternative names.

set_alternative_names(self, names: Union[Mapping, Sequence])

Set the alternative names.

Parameters: names (Mapping or Sequence) – If a mapping, with keys as the codes that appear in alternative_codes, and values that are the names, these will be used. Any missing codes will be labeled with the string representation of the code. If given as a sequence, the names must be in the same order as the codes that appear in alternative_codes.

Attributes

data_co

A pandas.DataFrame in idco format.

This DataFrame should have a simple pandas.Index as the index, where the index values are is the caseids.

data_ca

A pandas.DataFrame in idca format.

This DataFrame should have a two-level MultiIndex as the index, where the first level is the caseids and the second level is the alternative codes.

n_alts

The number of alternatives.

n_cases

The number of cases.

caseindex

The indexes of the cases.

Examples¶

[1]:

import pandas as pd
import larch
from larch.data_warehouse import example_file


There are two standard example datasets included with Larch. The MTC example demonstrates working with data that is (originally) in idca format, while the swissmetro example demonstrates working with data that is in idco format.

idca¶

To start with, we’ll load the MTC example data using pandas to create a normal DataFrame, although we’ll identify that it will have a two-level MultiIndex, using the case and alt identifiers.

[2]:

mtc_raw = pd.read_csv(example_file("MTCwork.csv.gz"),index_col=['casenum','altnum'])

[2]:

chose ivtt ovtt tottime totcost hhid perid numalts dist wkzone ... numadlt nmlt5 nm5to11 nm12to16 wkccbd wknccbd corredis vehbywrk vocc wgt
casenum altnum
1 1 1 13.38 2.0 15.38 70.63 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
2 0 18.38 2.0 20.38 35.32 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
3 0 20.38 2.0 22.38 20.18 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
4 0 25.90 15.2 41.10 115.64 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
5 0 40.50 2.0 42.50 0.00 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
2 1 0 29.92 10.0 39.92 390.81 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
2 0 34.92 10.0 44.92 195.40 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
3 0 21.92 10.0 31.92 97.97 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
4 1 22.96 14.2 37.16 185.00 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
5 0 58.95 10.0 68.95 0.00 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
3 1 1 8.60 6.0 14.60 37.76 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
2 0 13.60 6.0 19.60 18.88 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
3 0 15.60 6.0 21.60 10.79 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
4 0 16.87 21.4 38.27 105.00 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
4 1 0 30.60 8.5 39.10 417.32 6 1 2 14.58 665 ... 2 1 0 0 1 0 0 1.00 0 1

15 rows × 36 columns

To prepare this data for use with Larch, we’ll load it into a larch.DataFrames object.

[3]:

mtc = larch.DataFrames(mtc_raw)
mtc.info()

larch.DataFrames:  (not computation-ready)
n_cases: 5029
n_alts: 6
data_ce: 36 variables, 22033 rows
data_co: <not populated>
data_av: <populated>


Because this data has a row for each available alternative, and omits rows for unavailable alternatives, Larch has stored it in the sparse data_ce attribute. It’s also used this information to populate the data_av attribute. The “not computation-ready” is indicating that the data stored is not all using the standard computational dtype (float64), so this dataframe isn’t ready to use for model estimation (yet). Larch can fix that itself later, so there’s no need to worry.

You might notice that the data_co says “not populated”, we are are starting with data in idce (sparse idca) format. If we want to pre-process it to crack the data into seperate idco and idce parts, we can use the crack argument. This will find all the data columns that have no within-case variance, and move them to the data_co attribute.

[4]:

mtc = larch.DataFrames(mtc_raw, crack=True)
mtc.info()

larch.DataFrames:  (not computation-ready)
n_cases: 5029
n_alts: 6
data_ce: 5 variables, 22033 rows
data_co: 31 variables
data_av: <populated>

If we want, we can also identify which data is the “choice” at this stage.
(We can also leave that up to the Model object to be defined later.) To do so here, we can identify the data column that includes the choices.
[5]:

mtc = larch.DataFrames(mtc_raw, crack=True, ch='chose')
mtc.info()

larch.DataFrames:  (not computation-ready)
n_cases: 5029
n_alts: 6
data_ce: 5 variables, 22033 rows
data_co: 31 variables
data_av: <populated>
data_ch: chose


idco¶

To contrast, we’ll load the swissmetro example data, which is in idco format. Again we’ll use pandas to start by loading a normal DataFrame.

[6]:

raw = pd.read_csv(example_file('swissmetro.csv.gz')).query("PURPOSE in (1,3) and CHOICE != 0")

[6]:

GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE ... TRAIN_TT TRAIN_CO TRAIN_HE SM_TT SM_CO SM_HE SM_SEATS CAR_TT CAR_CO CHOICE
0 2 0 1 1 1 0 1 1 0 3 ... 112 48 120 63 52 20 0 117 65 2
1 2 0 1 1 1 0 1 1 0 3 ... 103 48 30 60 49 10 0 117 84 2
2 2 0 1 1 1 0 1 1 0 3 ... 130 48 60 67 58 30 0 117 52 2
3 2 0 1 1 1 0 1 1 0 3 ... 103 40 30 63 52 20 0 72 52 2
4 2 0 1 1 1 0 1 1 0 3 ... 130 36 60 63 42 20 0 90 84 2

5 rows × 28 columns

We can create a simple DataFrames object simply by giving this raw data to the constructor.

[7]:

sm = larch.DataFrames(raw)
sm.info()

larch.DataFrames:  (not computation-ready)
n_cases: 6768
n_alts: 0
data_ca: <not populated>
data_co: 28 variables


When we loaded the idca example data above, Larch automatically detected the set of alternatives based on the data. With idco data, we cannot infer the alternatives without some additional context. One way to do that is to give alternative id codes explicitly.

[8]:

sm = larch.DataFrames(raw, alt_codes=[1,2,3])
sm.info()

larch.DataFrames:  (not computation-ready)
n_cases: 6768
n_alts: 3
data_ca: <not populated>
data_co: 28 variables


Larch can also infer the alternative codes if we identify the column containing the choices. (Note this only works if every alternative is chosen at least once in the data, otherwise the inferred alternative codes will be incomplete.)

[9]:

sm = larch.DataFrames(raw, ch="CHOICE")
sm.info()

larch.DataFrames:  (not computation-ready)
n_cases: 6768
n_alts: 3
data_ca: <not populated>
data_co: 28 variables
data_ch: CHOICE