# 300: Itinerary Choice Data¶

[1]:

import larch, pandas, os, gzip
larch.__version__

[1]:

'5.4.0'


The example itinerary choice described here is based on data derived from a ticketing database provided by the Airlines Reporting Corporation. The data represent ten origin destination pairs for travel in U.S. continental markets in May of 2013. Itinerary characteristics have been masked, e.g., carriers are labeled generically as “carrier X” and departure times have been aggregated into categories. A fare is provided but is not completely accurate (a random error has been added to each fare). These modifications were made to satisfy nondisclosure agreements, so that the data can be published freely for teaching and demostration purposes. It is generally representative of real itinerary choice data used in practice, and the results obtained from this data are intuitive from a behavioral perspective, but it is not quite accurate and should not be used for behavioral studies.

In this example we will import the air itinerary choice example dataset, starting from a csv text file in idca format. Suppose that data file is gzipped, named “arc.csv.gz” and is located in the current directory (use os.getcwd() to see what is the current directory).

[2]:

from larch.data_warehouse import example_file

[3]:

with gzip.open(example_file("arc"), 'rt') as previewfile:
print(*(next(previewfile) for x in range(70)))

id_case,id_alt,choice,traveler,origin,destination,nb_cnxs,elapsed_time,fare_hy,fare_ly,equipment,carrier,timeperiod
1,1,0,1,444,222,1,300,470.55658,0,1,3,7
1,2,0,1,444,222,1,345,475.92258,0,2,3,5
1,3,0,1,444,222,1,335,443.48166,0,1,3,2
1,4,0,1,444,222,1,435,433.56735,0,2,3,2
1,5,0,1,444,222,1,710,449.83664,0,2,3,2
1,6,0,1,444,222,1,380,470.45175,0,1,3,5
1,7,0,1,444,222,1,345,440.70526,0,2,3,6
1,8,0,1,444,222,1,320,474.57831,0,2,3,4
1,9,0,1,444,222,1,335,474.97363,0,2,3,3
1,10,0,1,444,222,1,335,481.98392,0,1,3,3
1,11,0,1,444,222,1,320,440.41138,0,1,3,7
1,12,0,1,444,222,1,360,455.11444,0,2,3,1
1,13,0,1,444,222,1,380,470.67239,0,1,3,4
1,14,14,1,444,222,0,215,317.4277,0,2,3,1
1,15,19,1,444,222,0,215,283.96292,0,2,3,1
1,16,19,1,444,222,0,215,285.04138,0,2,3,2
1,17,19,1,444,222,0,215,283.59644,0,2,3,2
1,18,1,1,444,222,0,220,276.40555,0,2,3,3
1,19,8,1,444,222,0,220,285.51282,0,2,3,3
1,20,10,1,444,222,0,215,313.89828,0,2,3,3
1,21,7,1,444,222,0,220,280.06757,0,2,3,4
1,22,1,1,444,222,0,220,294.53979,0,2,3,4
1,23,5,1,444,222,0,220,285.1618,0,2,3,5
1,24,1,1,444,222,0,220,287.32828,0,2,3,5
1,25,22,1,444,222,0,225,274.38611,0,2,3,6
1,26,16,1,444,222,0,225,286.12646,0,2,3,7
1,27,11,1,444,222,0,225,300.91037,0,2,3,6
1,28,5,1,444,222,0,220,301.78799,0,2,3,7
1,29,5,1,444,222,0,220,311.88431,0,2,3,7
1,30,3,1,444,222,0,215,285.65631,0,2,3,8
1,31,4,1,444,222,0,215,283.51544,0,2,3,8
1,32,0,1,444,222,1,512,467.40497,0,1,1,3
1,33,0,1,444,222,1,411,474.33835,0,1,1,2
1,34,0,1,444,222,1,508,433.01563,0,1,1,5
1,35,0,1,444,222,1,387,457.29861,0,1,1,3
1,36,0,1,444,222,1,389,461.02136,0,1,1,4
1,37,0,1,444,222,1,392,465.53665,0,1,1,5
1,38,0,1,444,222,1,389,472.26083,0,1,1,4
1,39,0,1,444,222,1,379,438.02396,0,1,1,4
1,40,0,1,444,222,1,343,474.71518,0,1,1,1
1,41,0,1,444,222,1,389,437.87329,0,1,1,4
1,42,0,1,444,222,1,405,448.78522,0,1,1,6
1,43,0,1,444,222,1,392,473.38318,0,1,1,2
1,44,0,1,444,222,1,434,444.40308,0,1,1,1
1,45,3,1,444,222,0,214,248.23685,0,2,2,6
1,46,0,1,444,222,0,223,255.85193,0,2,2,3
1,47,3,1,444,222,0,214,253.83798,0,2,2,6
1,48,0,1,444,222,0,223,239.98866,0,2,2,3
1,49,0,1,444,222,0,219,282.74249,0,1,2,7
1,50,3,1,444,222,0,221,265.04773,0,1,2,6
1,51,1,1,444,222,0,219,281.88403,0,1,2,7
1,52,0,1,444,222,0,214,252.09259,0,1,2,4
1,53,3,1,444,222,0,214,264.69473,0,1,2,4
1,54,0,1,444,222,0,215,255.55827,0,1,2,7
1,55,0,1,444,222,1,396,423.67627,0,1,2,8
1,56,0,1,444,222,0,215,278.64148,0,1,2,8
1,57,3,1,444,222,0,215,234.55371,0,1,2,1
1,58,0,1,444,222,1,578,268.89609,0,2,4,1
1,59,0,1,444,222,1,477,285.80167,0,2,4,1
1,60,0,1,444,222,1,599,259.35504,0,2,4,4
1,61,1,1,444,222,1,758,262.39859,0,2,4,4
1,62,0,1,444,222,1,476,267.64124,0,2,4,5
1,63,0,1,444,222,1,477,273.67731,0,2,4,7
1,64,0,1,444,222,1,459,283.35803,0,2,4,6
1,65,0,1,444,222,1,586,291.98431,0,2,4,3
1,66,0,1,444,222,1,618,298.26163,0,2,4,6
1,67,0,1,444,222,1,502,259.70834,0,2,4,2
2,1,3,2,444,222,1,300,0,422.4599,1,3,7
2,2,1,2,444,222,1,345,0,415.59332,2,3,5



The first line of the file contains column headers. After that, each line represents an alternative available to one or more decision makers. In our sample data, we see the first 67 lines of data share a id_case of 1, indicating that they are 67 different itineraries available to the first decision maker type. An identidier of the alternatives is given by the number in the column id_alt, although this number is simply a sequential counter within each case in the raw data, and conveys no other information about the itinerary or its attributes. The observed choices of the decision maker[s] are indicated in the column choice. That column counts the number of travelers who face this choice set and chose the itinerary described by this row in the file.

We can load this data easily using pandas. We’ll also set the index of the resulting DataFrame to be the case and alt identifiers.

[4]:

itin = pandas.read_csv(example_file("arc"), index_col=['id_case','id_alt'])
itin.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6023 entries, (1, 1) to (105, 51)
Data columns (total 11 columns):
#   Column        Non-Null Count  Dtype
---  ------        --------------  -----
0   choice        6023 non-null   int64
1   traveler      6023 non-null   int64
2   origin        6023 non-null   int64
3   destination   6023 non-null   int64
4   nb_cnxs       6023 non-null   int64
5   elapsed_time  6023 non-null   int64
6   fare_hy       6023 non-null   float64
7   fare_ly       6023 non-null   float64
8   equipment     6023 non-null   int64
9   carrier       6023 non-null   int64
10  timeperiod    6023 non-null   int64
dtypes: float64(2), int64(9)
memory usage: 537.2 KB

[5]:

itin.head()

[5]:

choice traveler origin destination nb_cnxs elapsed_time fare_hy fare_ly equipment carrier timeperiod
id_case id_alt
1 1 0 1 444 222 1 300 470.55658 0.0 1 3 7
2 0 1 444 222 1 345 475.92258 0.0 2 3 5
3 0 1 444 222 1 335 443.48166 0.0 1 3 2
4 0 1 444 222 1 435 433.56735 0.0 2 3 2
5 0 1 444 222 1 710 449.83664 0.0 2 3 2
[6]:

d = larch.DataFrames(itin, ch='choice', crack=True, autoscale_weights=True)
d.info(1)

rescaled array of weights by a factor of 2239.980952380952

larch.DataFrames:  (not computation-ready)
n_cases: 105
n_alts: 127
data_ce: 6023 rows
- choice       (6023 non-null int64)
- nb_cnxs      (6023 non-null int64)
- elapsed_time (6023 non-null int64)
- fare_hy      (6023 non-null float64)
- fare_ly      (6023 non-null float64)
- equipment    (6023 non-null int64)
- carrier      (6023 non-null int64)
- timeperiod   (6023 non-null int64)
data_co:
- traveler    (105 non-null int64)
- origin      (105 non-null int64)
- destination (105 non-null int64)
data_av: <populated>
data_ch: choice
data_wt: computed_weight (/ 2239.980952380952)


By setting crack to True, Larch automatically analyzed the data to find variables that do not vary within cases, and transformed those into idco format variables. If you would prefer that Larch not do this you can omit this argument or set it to False. This is particularly important for larger datasets (the data sample included is only a tiny extract of the data that might be available for this kind of model), as breaking the data into seperate idca and idco parts is a relatively expensive operation, and it is not actually required for most model structures.

Also, you may note that in creating the DataFrames object, the set of all possible alternatives was deduced automatically from all the values in the altid column. You will note that, while the sample case we have peeked at in the beginning of the raw data file has 67 alternatives, there are other observations in the file with alternatives numbering up to 127.