![](/img/trans.png)
[英]The fastest way to update (partial sum of elements with complex conditions) the pandas dataframe
[英]Fastest way to perform complex search on pandas dataframe
我试图找出在 Pandas 数据帧上执行搜索和排序的最快方法。 以下是我试图完成的数据帧之前和之后。
前:
flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555
搜索/排序后:
flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555
在这个例子中,我基本上是想过滤掉存在于最终目的地之间的“航班”。 这应该通过使用某种删除重复项方法来完成,但让我感到困惑的是如何处理所有列。 二分搜索是实现这一目标的最佳方法吗? 提示表示赞赏,努力解决这个问题。
可能的边缘情况:
如果数据被切换并且我们的端连接在同一列中怎么办?
flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999
搜索/排序后:
flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999
这种情况在逻辑上不应该发生。 毕竟你怎么能去 DEF-ABC 和 DEF-XYZ? 你不能,但“端点”仍然是 ABC-XYZ
这是网络问题,所以我们使用networkx
,注意,这里你可以有两个以上的站,这意味着你可以有像NY-DC-WA-NC
import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')
# create the nx object from pandas dataframe
l=list(nx.connected_components(G))
# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others
d={k: v for d in L for k, v in d.items()}
# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output
Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555
安装网络networkx
pip install networkx
conda install -c anaconda networkx
这是一个 NumPy 解决方案,在与性能相关的情况下可能会很方便:
def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
keep_cols = np.broadcast_to([3,5],(len(col),2))
ixs = np.concatenate([col[:,None], keep_cols], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)
建议的数据帧产生预期的输出:
print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 DEF XYZ 456 893 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555
这种方法在重复所在的行方面不假设任何特定的顺序,这同样适用于列(以涵盖问题中描述的边缘情况)。 例如,如果我们使用以下数据框:
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.