简体   繁体   中英

Python pandas summarize round trip in dataframe

I have a dataframe (~30 000 rows) count of trips by station code.

|station from|station to|count|
|:-----------|:---------|:----|
|20001       |20040     |55   |
|20040       |20001     |67   |
|20007       |20080     |100  |
|20080       |20007     |50   |

how is it possible to get df where there is a number of return trips and extra lines of return trips were removed, like

|station from|station to|count|count_back|
|:-----------|:---------|:----|:---------|
|20001       |20040     |55   |67        |
|20007       |20080     |100  |50        |

my solution is

  1. make a duplicate of the dataframe
  2. make a compound key, changing the departure and destination stations in the duplicate dataframe
  3. do merge
  4. delete unnecessary columns and rows.

But that seems to be very inefficient

Let's try sort the stations and pivot:

# the two stations
cols = ['station from', 'station to']

# back and fort
df['col'] = np.where(df['station from'] < df['station to'], 'count', 'count_back')

# rearrange the stations
df[cols] = np.sort(df[cols], axis=1)

# pivot
print(df.pivot(index=cols, columns='col', values='count')
   .reset_index()
)

Output:

col  station from  station to  count  count_back
0           20001       20040     55          67
1           20007       20080    100          50

Here is a simple solution which handles the cases without round trip.

import pandas as pd
import numpy as np
df = pd.DataFrame({"station from":[20001,20040,20007,20080, 2, 3],
                   "station to":[20040,20001,20080,20007, 1, 4],
                   "count":[55,67,100,50, 20, 40]})
df

在此处输入图像描述

df = df.set_index(["station from", "station to"])
df["count_back"] = df.apply(lambda row: df["count"].get((row.name[::-1])), axis=1)
mask_rows_to_delete = df.apply(lambda row: row.name[0] > row.name[1] and row.name[::-1] in df.index, axis=1)
df = df[~mask_rows_to_delete].reset_index()
df

在此处输入图像描述

This works even in the face of duplicated entries, and is quite fast (<250ms per million rows):

def roundtrip(df):
    a, b, c, d = 'station from', 'station to', 'count', 'count_back'
    idx = df[a] > df[b]
    df = df.assign(**{d: 0})
    df.loc[idx, [a, b, c, d]] = df.loc[idx, [b, a, d, c]].values
    return df.groupby([a, b]).sum()

On your example data (and yes, you can .reset_index() if your prefer):

>>> roundtrip(df)
                         count  count_back
station from station to                   
20001        20040          54          55
20007        20080         100          50

Timing test:

n = 1_000_000
df = pd.DataFrame({
    'station from': np.random.randint(1000, 2000, n),
    'station to': np.random.randint(1000, 2000, n),
    'count': np.random.randint(0, 200, n),
})

%timeit roundtrip(df)
217 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(On 100K rows, it is 32.4 ms ± 333 µs per loop)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM