[英]Compare rows of two data-frames with unequal lengths
我對熊貓還很陌生,希望獲得有關如何以最佳方式解決問題的反饋。 我正在嘗試從兩個長度不等的數據幀中評估兩個列的值,以發現兩種情況:
復雜點在於,我想避免使用應用或循環方法。 這些數據集可能會變得異常龐大(以下示例有意簡化了),我的理解是,可以有更多的方法來處理此問題。
data_set_1 = pd.DataFrame({"id1": ["A", "B", "C", "D"], "id2": ["1", "2", "2", "1"]})
data_set_2 = pd.DataFrame({"id1": ["A", "B", "F", "C", "D", "E"], "id2": ["1", "1", "2", "1", "1", "2"],"id3": ["1","2","3","4","5","6"]})
我期望返回什么:
1. E, F
2.
(B, 1)
(F, 2)
(C, 1)
(E, 2)
到目前為止,我已經嘗試了以下方法:
要獲取data_set_1中不存在的產品:
data_set_2.loc[~(data_set_2.id1.isin(data_set_1.id1))]
(我不確定這是否是最好的方法)-要獲取id1,data_set_1中不存在的id2組合:
我嘗試了一個isin語句,似乎兩個數據框的長度似乎是一個問題,因為熊貓將對兩個數據框之間的同一索引行求值,並且它獨立地對每一列求值。
我發現我可以像這樣索引多個列值:
data_set_2.set_index(["id1", "id2"], inplace=True,drop=False)
data_set_1.set_index(["id1", "id2"], inplace=True,drop=False)
讓我這樣做:
~data_set_2[["id1","id2"]].isin(data_set_1)
A 1 False False
B 1 True True
F 2 True True
C 1 True True
D 1 False False
E 2 True True
盡管這提供了我想要的功能,但是我無法在位置選擇操作中選擇評估為True的行:
data_set_2.loc[~data_set_2[["id1","id2"]].isin(data_set_1)]
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/bfm/lib/python/pandas/0.20.2-cp35/pandas/core/indexing.py", line 1328, in __getitem__
return self._getitem_axis(key, axis=0)
File "/usr/local/bfm/lib/python/pandas/0.20.2-cp35/pandas/core/indexing.py", line 1539, in _getitem_axis
raise ValueError('Cannot index with multidimensional key')
ValueError: Cannot index with multidimensional key
讓我認為這不是解決問題的正確方法。 關於如何最好地實現這一點的任何想法?
對於第一種情況,可以使用np.setdiff1d
:
vals = np.setdiff1d(data_set_2.id1, data_set_1.id1)
print(vals)
array(['E', 'F'], dtype=object)
對於第二種情況, setdiff1d
不起作用,但是簡單的集合差異就足夠了。
vals = set(data_set_2.iloc[:, :2].apply(tuple, 1)) \
- set(data_set_1.apply(tuple, 1))
print(vals)
{('B', '1'), ('C', '1'), ('E', '2'), ('F', '2')}
另外,為了改進您現有的方法,您可以按照以下方式進行操作:
m = ~data_set_2[["id1","id2"]].isin(data_set_1)
print(m[m.all(1)])
id1 id2
id1 id2
B 1 True True
F 2 True True
C 1 True True
E 2 True True
vals = m[m.all(1)].index.tolist()
print(vals)
[('B', '1'), ('F', '2'), ('C', '1'), ('E', '2')]
您可以嘗試使用反聯接來獲取所需的數據。
import pandas as pd
data_set_1 = pd.DataFrame({"id1": ["A", "B", "C", "D"], "id2": ["1", "2", "2", "1"]})
data_set_2 = pd.DataFrame({"id1": ["A", "B", "F", "C", "D", "E"], "id2": ["1", "1", "2", "1", "1", "2"],"id3": ["1","2","3","4","5","6"]})
# Merging two data frame on id1, then filtering base on indicator
data_result_1 = data_set_2.merge(data_set_1.loc[:, ["id1"]], on="id1", how="outer", indicator=True)
data_result_1 = data_result_1[data_result_1['_merge'] == 'left_only']
# Merging two data frame on id1 and id2, then filtering base on indicator
data_result_2 = data_set_2.merge(data_set_1.loc[:, ["id1", "id2"]], on=["id1", "id2"], how="outer", indicator=True)
data_result_2 = data_result_2[data_result_2['_merge'] == 'left_only']
print([tuple(x) for x in data_result_1.loc[:, ["id1"]].values])
print([tuple(x) for x in data_result_2.loc[:, ["id1", "id2"]].values])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.