简体   繁体   English

在pandas Dataframe中查找倒数行

[英]find reciprocal rows in pandas Dataframe

I have this dataframe, and need to retain only those lines having a reciprocal values for 2 columns (numA and numB here). 我有这个数据框,只需要保留两列具有倒数的行(此处为numA和numB)。

gpm = pd.DataFrame(data={
    'id':[1,2,3,4,5,6,7,8,9],
    'time':[150315,150315,150315,150315,150315,150315,150315,150315,150315],
    'numA':['A','D','C','B','A','C','A','E','D'],
    'numB':['B','C','B','A','B','D','B','A','A'],
    'antA':['MSPDV','VIELU','RMPC1','MJCIH','PALT2','M2PV3','MACIF','MACIF','VIELU'],
    'antB':['BPDV8','0GRI3','SSFDJ','SSFDJ','SSFDJ','CCPG1','0GRI3','SSFDJ','SSFDJ']
    })

I only want lines in which columns numA and numB are reciprocal. 我只希望其中numA和numB列是倒数的行。 That is, retaining al lines where the pairs (A,B), (B,A) and (C,D),(D,C) occur. 即,保留出现(A,B),(B,A)和(C,D),(D,C)对的al线。

My solution, for now, involves making a list of all unique identifiers and going through each line looking whether the actual partner is in the list of partners 目前,我的解决方案包括制作所有唯一标识符的列表,并逐行查看实际合作伙伴是否在合作伙伴列表中

it is extremely slow.... (and perhaps incorrect!) 它非常慢...。(也许不正确!)

## here's my code
parties = {}
nums = gpm['numA']+gpm['numB']
for i in nums.unique():
    parties[i] = gpm['numB'][gpm['numA'] == i]
    parties[i] = gpm['numA'][gpm['numB'] == i]

new_d = gpm.iloc[[0]]
for i in np.arange(1,gpm.shape[0]):
    numa = gpm.iloc[i]['numA']
    if gpm.iloc[i]['numB'] in parties[numa]:
        new_d.append(gpm.iloc[[i]])

any savvy coder that could help speed this up? 有什么精明的编码器可以帮助加快速度吗? The actual file to parse is a ~15GB csv. 要解析的实际文件是〜15GB的csv。

Thanks 谢谢

In your example, I assume the rows with id=3, 8 & 9, which are (C, B), (E, A) and (D, A), are unwanted ? 在您的示例中,我假设id = 3、8和9的行是(C,B),(E,A)和(D,A),是不需要的吗? If so, here's a standard way to select by comparing the values in numA and numB for specific acceptable combinations: 如果是这样,这是通过比较numAnumB中特定可接受组合的值来进行选择的一种标准方法:

In [5]: gpm[((gpm['numA'] == 'A') & (gpm['numB'] == 'B')) |
   ...:     ((gpm['numA'] == 'B') & (gpm['numB'] == 'A')) |
   ...:     ((gpm['numA'] == 'C') & (gpm['numB'] == 'D')) | 
   ...:     ((gpm['numA'] == 'D') & (gpm['numB'] == 'C'))
   ...: ]
Out[5]:
   id    time numA numB   antA   antB
0   1  150315    A    B  MSPDV  BPDV8
1   2  150315    D    C  VIELU  0GRI3
3   4  150315    B    A  MJCIH  SSFDJ
4   5  150315    A    B  PALT2  SSFDJ
5   6  150315    C    D  M2PV3  CCPG1
6   7  150315    A    B  MACIF  0GRI3

(assign the result of that to new_d ) (将结果分配给new_d

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM