[英]Filter one data frame based on other data frame in pandas
I have two DataFrames in pandas:我在 pandas 中有两个 DataFrame:
import pandas as pd
df1 = pd.DataFrame({'Name': ["A", "B", "C", "C","D","D","E"],
'start': [50, 124, 1, 159, 12, 26,110],
'stop': [60, 200, 19, 200, 24, 30,160]})
df2 = pd.DataFrame({'Name': ["B", "C","D","E"],
'start': [126, 143, 19, 159],
'stop': [129, 220, 27, 200]})
print(df1)
Name start stop
0 A 50 60
1 B 124 200
2 C 1 19
3 C 159 200
4 D 12 24
5 D 26 30
6 E 110 160
print(df2)
Name start stop
0 B 126 129
1 C 143 220
2 D 19 27
3 E 159 200
I want to filter df1 to remove rows based on df2 using the following criteria:我想使用以下条件过滤 df1 以删除基于 df2 的行:
This would give:这将给出:
Name start stop
0 B 124 200
1 C 159 200
2 D 12 24
3 D 26 30
4 E 110 160
Where:在哪里:
Any help would be greatly appreciated!任何帮助将不胜感激!
To solve your problem, I applied an SQL-like way that mimics the following query:为了解决您的问题,我应用了一种类似于 SQL 的方式来模仿以下查询:
SELECT
df.Name, df.start_x AS start, df.stop_x AS stop
FROM (
SELECT
df1.Name, df1.start AS start_x, df1.stop AS stop_x,
df2.start AS start_y, df2.stop AS stop_y
FROM df1
INNER JOIN df2
ON df1.Name = df2.Name
) AS df
WHERE (df.stop_y >= df.start_x) AND (df.stop_x >= df.start_y)
This query has been converted to the following code fragment that uses the pandas.merge
method.此查询已转换为使用
pandas.merge
方法的以下代码片段。 Note that you must use parentheses in the expression (df.stop_y> = df.start_x) & (df.stop_x> = df.start_y)
.请注意,您必须在表达式
(df.stop_y> = df.start_x) & (df.stop_x> = df.start_y)
中使用括号。 Without them, the code throws the exception没有它们,代码将引发异常
ValueError: The truth value of a Series is ambiguous.
ValueError:Series 的真值不明确。 Use a.empty, a.bool(), a.item(), a.any() or a.all().
使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。
import pandas as pd
df1 = pd.DataFrame({'Name': ["A", "B", "C", "C","D","D","E"],
'start': [50, 124, 1, 159, 12, 26,110],
'stop': [60, 200, 19, 200, 24, 30,160]})
df2 = pd.DataFrame({'Name': ["B", "C","D","E"],
'start': [126, 143, 19, 159],
'stop': [129, 220, 27, 200]})
df = pd.merge(df1, df2, on=['Name'])
df = df[(df.stop_y >= df.start_x) & (df.stop_x >= df.start_y)]
df.rename(columns={'start_x':'start', 'stop_x':'stop'}, inplace=True)
df.drop(['start_y', 'stop_y'], axis=1, inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output: Output:
Name start stop
0 B 124 200
1 C 159 200
2 D 12 24
3 D 26 30
4 E 110 160
For anyone who is interested, I figured out a way to do it...对于任何有兴趣的人,我想出了一个方法来做到这一点......
df3=[]
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
if row1["Name"] == row2["Name"]:
x = range(row1["start"],row1["stop"])
x = set(x)
y = range(row2["start"],row2["stop"])
if len(x.intersection(y)) > 0:
df3.append(row1)
df3 = pd.DataFrame(df3).reset_index(drop=True)
print(df3)
Name start stop
0 B 124 200
1 C 159 200
2 D 12 24
3 D 26 30
4 E 110 160
Gets the job done albeit a bit clumsy.完成工作,虽然有点笨拙。
Would be interested if anyone can suggest a less messy way!如果有人可以建议一种不那么混乱的方式,将会很感兴趣!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.