300: Itinerary Choice Data
300: Itinerary Choice Data¶
import gzip, os, pandas as pd
import larch.numba as lx
/home/runner/work/larch/larch/larch/larch/numba/model.py:23: UserWarning:
### larch.numba is experimental, and not feature-complete ###
the first time you import on a new system, this package will
compile optimized binaries for your machine, which may take
a little while, please be patient
warnings.warn( ### EXPERIMENTAL ### )
The example itinerary choice described here is based on data derived from a ticketing database provided by the Airlines Reporting Corporation. The data represent ten origin destination pairs for travel in U.S. continental markets in May of 2013. Itinerary characteristics have been masked, e.g., carriers are labeled generically as “carrier X” and departure times have been aggregated into categories. A fare is provided but is not completely accurate (a random error has been added to each fare). These modifications were made to satisfy nondisclosure agreements, so that the data can be published freely for teaching and demostration purposes. It is generally representative of real itinerary choice data used in practice, and the results obtained from this data are intuitive from a behavioral perspective, but it is not quite accurate and should not be used for behavioral studies.
In this example we will import the air itinerary choice example dataset, starting from a csv
text file in idca format. Suppose that data file is gzipped, named “arc.csv.gz” and
is located in the current directory (use os.getcwd
to see what is the
current directory).
with gzip.open(lx.example_file("arc"), 'rt') as previewfile:
print(*(next(previewfile) for x in range(70)))
id_case,id_alt,choice,traveler,origin,destination,nb_cnxs,elapsed_time,fare_hy,fare_ly,equipment,carrier,timeperiod
1,1,0,1,444,222,1,300,470.55658,0,1,3,7
1,2,0,1,444,222,1,345,475.92258,0,2,3,5
1,3,0,1,444,222,1,335,443.48166,0,1,3,2
1,4,0,1,444,222,1,435,433.56735,0,2,3,2
1,5,0,1,444,222,1,710,449.83664,0,2,3,2
1,6,0,1,444,222,1,380,470.45175,0,1,3,5
1,7,0,1,444,222,1,345,440.70526,0,2,3,6
1,8,0,1,444,222,1,320,474.57831,0,2,3,4
1,9,0,1,444,222,1,335,474.97363,0,2,3,3
1,10,0,1,444,222,1,335,481.98392,0,1,3,3
1,11,0,1,444,222,1,320,440.41138,0,1,3,7
1,12,0,1,444,222,1,360,455.11444,0,2,3,1
1,13,0,1,444,222,1,380,470.67239,0,1,3,4
1,14,14,1,444,222,0,215,317.4277,0,2,3,1
1,15,19,1,444,222,0,215,283.96292,0,2,3,1
1,16,19,1,444,222,0,215,285.04138,0,2,3,2
1,17,19,1,444,222,0,215,283.59644,0,2,3,2
1,18,1,1,444,222,0,220,276.40555,0,2,3,3
1,19,8,1,444,222,0,220,285.51282,0,2,3,3
1,20,10,1,444,222,0,215,313.89828,0,2,3,3
1,21,7,1,444,222,0,220,280.06757,0,2,3,4
1,22,1,1,444,222,0,220,294.53979,0,2,3,4
1,23,5,1,444,222,0,220,285.1618,0,2,3,5
1,24,1,1,444,222,0,220,287.32828,0,2,3,5
1,25,22,1,444,222,0,225,274.38611,0,2,3,6
1,26,16,1,444,222,0,225,286.12646,0,2,3,7
1,27,11,1,444,222,0,225,300.91037,0,2,3,6
1,28,5,1,444,222,0,220,301.78799,0,2,3,7
1,29,5,1,444,222,0,220,311.88431,0,2,3,7
1,30,3,1,444,222,0,215,285.65631,0,2,3,8
1,31,4,1,444,222,0,215,283.51544,0,2,3,8
1,32,0,1,444,222,1,512,467.40497,0,1,1,3
1,33,0,1,444,222,1,411,474.33835,0,1,1,2
1,34,0,1,444,222,1,508,433.01563,0,1,1,5
1,35,0,1,444,222,1,387,457.29861,0,1,1,3
1,36,0,1,444,222,1,389,461.02136,0,1,1,4
1,37,0,1,444,222,1,392,465.53665,0,1,1,5
1,38,0,1,444,222,1,389,472.26083,0,1,1,4
1,39,0,1,444,222,1,379,438.02396,0,1,1,4
1,40,0,1,444,222,1,343,474.71518,0,1,1,1
1,41,0,1,444,222,1,389,437.87329,0,1,1,4
1,42,0,1,444,222,1,405,448.78522,0,1,1,6
1,43,0,1,444,222,1,392,473.38318,0,1,1,2
1,44,0,1,444,222,1,434,444.40308,0,1,1,1
1,45,3,1,444,222,0,214,248.23685,0,2,2,6
1,46,0,1,444,222,0,223,255.85193,0,2,2,3
1,47,3,1,444,222,0,214,253.83798,0,2,2,6
1,48,0,1,444,222,0,223,239.98866,0,2,2,3
1,49,0,1,444,222,0,219,282.74249,0,1,2,7
1,50,3,1,444,222,0,221,265.04773,0,1,2,6
1,51,1,1,444,222,0,219,281.88403,0,1,2,7
1,52,0,1,444,222,0,214,252.09259,0,1,2,4
1,53,3,1,444,222,0,214,264.69473,0,1,2,4
1,54,0,1,444,222,0,215,255.55827,0,1,2,7
1,55,0,1,444,222,1,396,423.67627,0,1,2,8
1,56,0,1,444,222,0,215,278.64148,0,1,2,8
1,57,3,1,444,222,0,215,234.55371,0,1,2,1
1,58,0,1,444,222,1,578,268.89609,0,2,4,1
1,59,0,1,444,222,1,477,285.80167,0,2,4,1
1,60,0,1,444,222,1,599,259.35504,0,2,4,4
1,61,1,1,444,222,1,758,262.39859,0,2,4,4
1,62,0,1,444,222,1,476,267.64124,0,2,4,5
1,63,0,1,444,222,1,477,273.67731,0,2,4,7
1,64,0,1,444,222,1,459,283.35803,0,2,4,6
1,65,0,1,444,222,1,586,291.98431,0,2,4,3
1,66,0,1,444,222,1,618,298.26163,0,2,4,6
1,67,0,1,444,222,1,502,259.70834,0,2,4,2
2,1,3,2,444,222,1,300,0,422.4599,1,3,7
2,2,1,2,444,222,1,345,0,415.59332,2,3,5
The first line of the file contains column headers. After that, each line represents
an alternative available to one or more decision makers. In our sample data, we see the first 67
lines of data share a id_case
of 1, indicating that they are 67 different itineraries
available to the first decision maker type. An identidier of the alternatives is given by the
number in the column id_alt
, although this number is simply a sequential counter within each case
in the raw data, and conveys no other information about the itinerary or its attributes.
The observed choices of the decision maker[s] are indicated in the column choice
.
That column counts the number of travelers who face this choice set and chose the itinerary
described by this row in the file.
We can load this data easily using pandas. We’ll also set the index of the resulting DataFrame to be the case and alt identifiers.
itin = pd.read_csv(lx.example_file("arc"), index_col=['id_case','id_alt'])
itin.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6023 entries, (1, 1) to (105, 51)
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 choice 6023 non-null int64
1 traveler 6023 non-null int64
2 origin 6023 non-null int64
3 destination 6023 non-null int64
4 nb_cnxs 6023 non-null int64
5 elapsed_time 6023 non-null int64
6 fare_hy 6023 non-null float64
7 fare_ly 6023 non-null float64
8 equipment 6023 non-null int64
9 carrier 6023 non-null int64
10 timeperiod 6023 non-null int64
dtypes: float64(2), int64(9)
memory usage: 545.3 KB
d = lx.Dataset.construct.from_idca(itin, crack=True)
d
<xarray.Dataset> Dimensions: (id_case: 105, id_alt: 127) Coordinates: * id_case (id_case) int64 1 2 3 4 5 6 7 8 ... 99 100 101 102 103 104 105 * id_alt (id_alt) int64 1 2 3 4 5 6 7 8 ... 121 122 123 124 125 126 127 Data variables: choice (id_case, id_alt) int64 0 0 ... -9223372036854775808 traveler (id_case) int64 1 2 1 2 1 2 1 2 1 2 1 ... 1 2 1 2 2 1 2 2 1 2 origin (id_case) int64 444 444 444 444 444 ... 777 777 777 777 777 destination (id_case) int64 222 222 222 222 222 ... 111 111 111 111 111 nb_cnxs (id_case, id_alt) int64 1 1 ... -9223372036854775808 elapsed_time (id_case, id_alt) int64 300 345 ... -9223372036854775808 fare_hy (id_case, id_alt) float64 470.6 475.9 443.5 ... nan nan nan fare_ly (id_case, id_alt) float64 0.0 0.0 0.0 0.0 ... nan nan nan nan equipment (id_case, id_alt) int64 1 2 ... -9223372036854775808 carrier (id_case, id_alt) int64 3 3 ... -9223372036854775808 timeperiod (id_case, id_alt) int64 7 5 ... -9223372036854775808 _avail_ (id_case, id_alt) int8 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 Attributes: _caseid_: id_case _altid_: id_alt
By setting crack
to True
, Larch automatically analyzed the data to find variables that do not vary within
cases, and transformed those into |idco| format variables. If you would prefer that
Larch not do this you can omit this argument or set it to False. This is particularly
important for larger datasets (the data sample included is only a tiny extract of the data
that might be available for this kind of model), as breaking the data into seperate |idca| and |idco| parts is
a relatively expensive operation, and it is not actually required for most model structures.
Also, you may note that in creating the Dataset object, the set of all
possible alternatives was deduced automatically from all the values
in the altid
column. You will note that, while the sample case we have peeked at in the beginning
of the raw data file has 67 alternatives, there are other observations in the file with alternatives numbering
up to 127.