在 Pandas 數據幀上執行復雜搜索的最快方法

Question

我試圖找出在 Pandas 數據幀上執行搜索和排序的最快方法。 以下是我試圖完成的數據幀之前和之后。

前：

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       DEF       123     456     8000    8000
   DEF       XYZ       456     893     9999    9999
   AAA       BBB       473     917     5555    5555
   BBB       CCC       917     341     5555    5555

搜索/排序后：

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       XYZ       123     893     8000    9999
   AAA       CCC       473     341     5555    5555

在這個例子中，我基本上是想過濾掉存在於最終目的地之間的“航班”。 這應該通過使用某種刪除重復項方法來完成，但讓我感到困惑的是如何處理所有列。 二分搜索是實現這一目標的最佳方法嗎？ 提示表示贊賞，努力解決這個問題。

可能的邊緣情況：

如果數據被切換並且我們的端連接在同一列中怎么辦？

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       DEF       123     456     8000    8000
   XYZ       DEF       893     456     9999    9999

搜索/排序后：

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       XYZ       123     893     8000    9999

這種情況在邏輯上不應該發生。 畢竟你怎么能去 DEF-ABC 和 DEF-XYZ？ 你不能，但“端點”仍然是 ABC-XYZ

Answer 1

這是網絡問題，所以我們使用networkx ，注意，這里你可以有兩個以上的站，這意味着你可以有像NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d={k: v for d in L for k, v in d.items()}

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
         flightTo flightFrom  toNum  fromNum  toCode  fromCode
flightTo                                                      
0             ABC        XYZ    123      893    8000      9999
1             AAA        CCC    473      341    5555      5555

安裝網絡networkx

PIP： pip install networkx
蟒蛇： conda install -c anaconda networkx

Answer 2

這是一個 NumPy 解決方案，在與性能相關的情況下可能會很方便：

def remove_middle_dest(df):
    x = df.to_numpy()
    # obtain a flat numpy array from both columns
    b = x[:,0:2].ravel()
    _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
    # Index of duplicate values in b
    ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
    # Indices to be used to replace the content in the columns
    replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
    # Col index of where duplicate value is, 0 or 1
    col = (ixs_drop % 2) ^ 1
    # 2d array to index and replace values in the df
    # index to obtain values with which to replace
    keep_cols = np.broadcast_to([3,5],(len(col),2))
    ixs = np.concatenate([col[:,None], keep_cols], 1)
    # translate indices to row indices
    rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
    c = np.empty((len(col), 5), dtype=x.dtype)
    c[:,::2] = x[rows_drop[:,None], ixs]
    c[:,1::2] = x[rows_replace[:,None], [2,4]]
    # update dataframe and drop rows
    df.iloc[rows_replace, 1:] = c
    return df.drop(rows_drop)

建議的數據幀產生預期的輸出：

print(df)
    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        DEF    123      456    8000      8000
1      DEF        XYZ    456      893    9999      9999
2      AAA        BBB    473      917    5555      5555
3      BBB        CCC    917      341    5555      5555

remove_middle_dest(df)

    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        XYZ    123      893    8000      9999
2      AAA        CCC    473      341    5555      5555

這種方法在重復所在的行方面不假設任何特定的順序，這同樣適用於列（以涵蓋問題中描述的邊緣情況）。 例如，如果我們使用以下數據框：

    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        DEF    123      456    8000      8000
1      XYZ        DEF    893      456    9999      9999
2      AAA        BBB    473      917    5555      5555
3      BBB        CCC    917      341    5555      5555

remove_middle_dest(df)

     flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        XYZ    123      456    8000      9999
2      AAA        CCC    473      341    5555      5555

在 Pandas 數據幀上執行復雜搜索的最快方法

問題描述

2 個解決方案

解決方案1
15 2019-05-28 14:19:09

解決方案2
6 2019-05-28 14:32:32

在 Pandas 數據幀上執行復雜搜索的最快方法

問題描述

2 個解決方案

解決方案1 15 2019-05-28 14:19:09

解決方案2 6 2019-05-28 14:32:32

解決方案1
15 2019-05-28 14:19:09

解決方案2
6 2019-05-28 14:32:32