简体   繁体   English

Python pandas 总结 dataframe 中的往返行程

[英]Python pandas summarize round trip in dataframe

I have a dataframe (~30 000 rows) count of trips by station code.我有一个 dataframe(约 30 000 行)按车站代码的行程计数。

|station from|station to|count|
|:-----------|:---------|:----|
|20001       |20040     |55   |
|20040       |20001     |67   |
|20007       |20080     |100  |
|20080       |20007     |50   |

how is it possible to get df where there is a number of return trips and extra lines of return trips were removed, like怎么可能在有多次回程的地方获得 df 并且删除了额外的回程行,例如

|station from|station to|count|count_back|
|:-----------|:---------|:----|:---------|
|20001       |20040     |55   |67        |
|20007       |20080     |100  |50        |

my solution is我的解决方案是

  1. make a duplicate of the dataframe复制 dataframe
  2. make a compound key, changing the departure and destination stations in the duplicate dataframe制作复合键,更改重复 dataframe 中的出发站和目的地站
  3. do merge合并
  4. delete unnecessary columns and rows.删除不必要的列和行。

But that seems to be very inefficient但这似乎效率很低

Let's try sort the stations and pivot:让我们尝试对站和 pivot sort

# the two stations
cols = ['station from', 'station to']

# back and fort
df['col'] = np.where(df['station from'] < df['station to'], 'count', 'count_back')

# rearrange the stations
df[cols] = np.sort(df[cols], axis=1)

# pivot
print(df.pivot(index=cols, columns='col', values='count')
   .reset_index()
)

Output: Output:

col  station from  station to  count  count_back
0           20001       20040     55          67
1           20007       20080    100          50

Here is a simple solution which handles the cases without round trip.这是一个简单的解决方案,无需往返即可处理案件。

import pandas as pd
import numpy as np
df = pd.DataFrame({"station from":[20001,20040,20007,20080, 2, 3],
                   "station to":[20040,20001,20080,20007, 1, 4],
                   "count":[55,67,100,50, 20, 40]})
df

在此处输入图像描述

df = df.set_index(["station from", "station to"])
df["count_back"] = df.apply(lambda row: df["count"].get((row.name[::-1])), axis=1)
mask_rows_to_delete = df.apply(lambda row: row.name[0] > row.name[1] and row.name[::-1] in df.index, axis=1)
df = df[~mask_rows_to_delete].reset_index()
df

在此处输入图像描述

This works even in the face of duplicated entries, and is quite fast (<250ms per million rows):即使面对重复的条目,这也有效,并且速度非常快(每百万行<250ms):

def roundtrip(df):
    a, b, c, d = 'station from', 'station to', 'count', 'count_back'
    idx = df[a] > df[b]
    df = df.assign(**{d: 0})
    df.loc[idx, [a, b, c, d]] = df.loc[idx, [b, a, d, c]].values
    return df.groupby([a, b]).sum()

On your example data (and yes, you can .reset_index() if your prefer):在您的示例数据上(是的,如果您愿意,您可以.reset_index() ):

>>> roundtrip(df)
                         count  count_back
station from station to                   
20001        20040          54          55
20007        20080         100          50

Timing test:时序测试:

n = 1_000_000
df = pd.DataFrame({
    'station from': np.random.randint(1000, 2000, n),
    'station to': np.random.randint(1000, 2000, n),
    'count': np.random.randint(0, 200, n),
})

%timeit roundtrip(df)
217 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(On 100K rows, it is 32.4 ms ± 333 µs per loop) (在 100K 行上,每个循环为 32.4 ms ± 333 µs)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM