简体   繁体   English

比较长度不相等的两个数据帧的行

[英]Compare rows of two data-frames with unequal lengths

I'm fairly new to pandas and looking to get some feedback on how to tackle the problem below the best way. 我对熊猫还很陌生,希望获得有关如何以最佳方式解决问题的反馈。 I'm trying to evaluate two column's values from two data frames of unequal lengths to find two cases: 我正在尝试从两个长度不等的数据帧中评估两个列的值,以发现两种情况:

  1. id1 in data_set_2 doesn't exist in data_set_1. data_set_1中的id1在data_set_1中不存在。
  2. (id1, id2) combination in data_set_2 does not exist in data_set_1. data_set_1中的(id1,id2)组合不存在。

The point of complexity is that I want to avoid using an apply or loop approach. 复杂点在于,我想避免使用应用或循环方法。 These data sets can get absurdly large (examples below are intentionally simplified), and my understanding is that there can be more methodical ways to handle this. 这些数据集可能会变得异常庞大(以下示例有意简化了),我的理解是,可以有更多的方法来处理此问题。

data_set_1 = pd.DataFrame({"id1": ["A", "B", "C", "D"], "id2": ["1", "2", "2", "1"]})
data_set_2 = pd.DataFrame({"id1": ["A", "B", "F", "C", "D", "E"], "id2": ["1", "1", "2", "1", "1", "2"],"id3": ["1","2","3","4","5","6"]})

What I expect returned: 我期望返回什么:

1. E, F

2.
(B, 1)
(F, 2)
(C, 1)
(E, 2)

What I've tried so far is the following: 到目前为止,我已经尝试了以下方法:

To get products that do not exist in data_set_1: 要获取data_set_1中不存在的产品:

data_set_2.loc[~(data_set_2.id1.isin(data_set_1.id1))] 

(This is where I'm not sure if this is the best way) - To get id1, id2 combinations that do not exist in data_set_1: (我不确定这是否是最好的方法)-要获取id1,data_set_1中不存在的id2组合:

I tried an isin statement, it seemed like the lengths of the two dataframes appear to be an issue since pandas will evaluate for the same index row between the two dataframes AND it evaluates each columns independently. 我尝试了一个isin语句,似乎两个数据框的长度似乎是一个问题,因为熊猫将对两个数据框之间的同一索引行求值,并且它独立地对每一列求值。

I found that I could index multiple column values as such: 我发现我可以像这样索引多个列值:

data_set_2.set_index(["id1", "id2"], inplace=True,drop=False)
data_set_1.set_index(["id1", "id2"], inplace=True,drop=False)

Which let's me do this: 让我这样做:

~data_set_2[["id1","id2"]].isin(data_set_1)
A   1    False  False
B   1     True   True
F   2     True   True
C   1     True   True
D   1    False  False
E   2     True   True

Although this gives me what I want, i wasn't able to select the rows that evaluate to True in a loc selection operation: 尽管这提供了我想要的功能,但是我无法在位置选择操作中选择评估为True的行:

data_set_2.loc[~data_set_2[["id1","id2"]].isin(data_set_1)]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/bfm/lib/python/pandas/0.20.2-cp35/pandas/core/indexing.py", line 1328, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/usr/local/bfm/lib/python/pandas/0.20.2-cp35/pandas/core/indexing.py", line 1539, in _getitem_axis
    raise ValueError('Cannot index with multidimensional key')
ValueError: Cannot index with multidimensional key

Made me think that this isn't the right way to approach the problem. 让我认为这不是解决问题的正确方法。 Any ideas on how this could be best achieved? 关于如何最好地实现这一点的任何想法?

For your first case, you can use np.setdiff1d : 对于第一种情况,可以使用np.setdiff1d

vals = np.setdiff1d(data_set_2.id1, data_set_1.id1)
print(vals)
array(['E', 'F'], dtype=object)

For the second case, setdiff1d does not work, but a simple set difference should do well enough. 对于第二种情况, setdiff1d不起作用,但是简单的集合差异就足够了。

vals = set(data_set_2.iloc[:, :2].apply(tuple, 1)) \
                       -  set(data_set_1.apply(tuple, 1))
print(vals)
{('B', '1'), ('C', '1'), ('E', '2'), ('F', '2')}

Alternatively, to improve upon your existing method, you might do something along these lines: 另外,为了改进您现有的方法,您可以按照以下方式进行操作:

m = ~data_set_2[["id1","id2"]].isin(data_set_1)

print(m[m.all(1)])
          id1   id2
id1 id2
B   1    True  True
F   2    True  True
C   1    True  True
E   2    True  True

vals = m[m.all(1)].index.tolist()

print(vals)
[('B', '1'), ('F', '2'), ('C', '1'), ('E', '2')]

You can try to use anti-join to get data you want. 您可以尝试使用反联接来获取所需的数据。

import pandas as pd
data_set_1 = pd.DataFrame({"id1": ["A", "B", "C", "D"], "id2": ["1", "2", "2", "1"]})
data_set_2 = pd.DataFrame({"id1": ["A", "B", "F", "C", "D", "E"], "id2": ["1", "1", "2", "1", "1", "2"],"id3": ["1","2","3","4","5","6"]})

# Merging two data frame on id1, then filtering base on indicator
data_result_1 = data_set_2.merge(data_set_1.loc[:, ["id1"]], on="id1", how="outer", indicator=True)
data_result_1 = data_result_1[data_result_1['_merge'] == 'left_only']

# Merging two data frame on id1 and id2, then filtering base on indicator
data_result_2 = data_set_2.merge(data_set_1.loc[:, ["id1", "id2"]], on=["id1", "id2"], how="outer", indicator=True)
data_result_2 = data_result_2[data_result_2['_merge'] == 'left_only']


print([tuple(x) for x in data_result_1.loc[:, ["id1"]].values])
print([tuple(x) for x in data_result_2.loc[:, ["id1", "id2"]].values])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM