简体   繁体   English

根据 pandas 中的其他数据帧过滤一个数据帧

[英]Filter one data frame based on other data frame in pandas

I have two DataFrames in pandas:我在 pandas 中有两个 DataFrame:

import pandas as pd

df1 = pd.DataFrame({'Name': ["A", "B", "C", "C","D","D","E"],
                   'start': [50, 124, 1, 159, 12, 26,110],
                   'stop': [60, 200, 19, 200, 24, 30,160]})
df2 = pd.DataFrame({'Name': ["B", "C","D","E"],
                   'start': [126, 143, 19, 159],
                   'stop': [129, 220, 27, 200]})

print(df1)

  Name  start  stop
0    A     50    60
1    B    124   200
2    C      1    19
3    C    159   200
4    D     12    24
5    D     26    30
6    E    110   160

print(df2)

  Name  start  stop
0    B    126   129
1    C    143   220
2    D     19    27
3    E    159   200

I want to filter df1 to remove rows based on df2 using the following criteria:我想使用以下条件过滤 df1 以删除基于 df2 的行:

  1. Name should be present in both df1 and df2名称应同时出现在 df1 和 df2 中
  2. The range from start to stop for a Name overlaps with the range from start to stop for that Name in the other DataFrame名称从开始到停止的范围与另一个 DataFrame 中该名称的从开始到停止的范围重叠

This would give:这将给出:

  Name  start  stop
0    B    124   200
1    C    159   200
2    D     12    24
3    D     26    30
4    E    110   160

Where:在哪里:

  • A has been dropped as there is no A in df2 A 已被删除,因为 df2 中没有 A
  • B is kept as the start and stop of B in df2 are nested in those of B in df1 B 被保留,因为 df2 中 B 的起点和终点嵌套在 df1 中 B 的起点和终点中
  • One of the C's of df1 has been dropped as its values didn't overlap with df2, whereas the other was kept as it is nested in the start and stop range of C in df2 df1 的一个 C 已被删除,因为它的值没有与 df2 重叠,而另一个被保留,因为它嵌套在 df2 中 C 的开始和停止范围内
  • Both D's are kept as both have an overlap with the range of D in df2两个 D 都被保留,因为它们都与 df2 中的 D 范围重叠
  • E is kept as its range overlaps with E in df2 E 被保留,因为它的范围与 df2 中的 E 重叠

Any help would be greatly appreciated!任何帮助将不胜感激!

To solve your problem, I applied an SQL-like way that mimics the following query:为了解决您的问题,我应用了一种类似于 SQL 的方式来模仿以下查询:

SELECT
  df.Name, df.start_x AS start, df.stop_x AS stop
FROM (
  SELECT
    df1.Name, df1.start AS start_x, df1.stop AS stop_x,
              df2.start AS start_y, df2.stop AS stop_y
    FROM df1
    INNER JOIN df2
      ON df1.Name = df2.Name
) AS df
WHERE (df.stop_y >= df.start_x) AND (df.stop_x >= df.start_y)

This query has been converted to the following code fragment that uses the pandas.merge method.此查询已转换为使用pandas.merge方法的以下代码片段。 Note that you must use parentheses in the expression (df.stop_y> = df.start_x) & (df.stop_x> = df.start_y) .请注意,您必须在表达式(df.stop_y> = df.start_x) & (df.stop_x> = df.start_y)中使用括号。 Without them, the code throws the exception没有它们,代码将引发异常

ValueError: The truth value of a Series is ambiguous. ValueError:Series 的真值不明确。 Use a.empty, a.bool(), a.item(), a.any() or a.all().使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

import pandas as pd

df1 = pd.DataFrame({'Name': ["A", "B", "C", "C","D","D","E"],
                   'start': [50, 124, 1, 159, 12, 26,110],
                   'stop': [60, 200, 19, 200, 24, 30,160]})
df2 = pd.DataFrame({'Name': ["B", "C","D","E"],
                   'start': [126, 143, 19, 159],
                   'stop': [129, 220, 27, 200]})
df = pd.merge(df1, df2, on=['Name'])
df = df[(df.stop_y >= df.start_x) & (df.stop_x >= df.start_y)]
df.rename(columns={'start_x':'start', 'stop_x':'stop'}, inplace=True)
df.drop(['start_y', 'stop_y'], axis=1, inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)

Output: Output:

  Name  start  stop
0    B    124   200
1    C    159   200
2    D     12    24
3    D     26    30
4    E    110   160

Demo on Repl.it . Repl.it上的演示。

For anyone who is interested, I figured out a way to do it...对于任何有兴趣的人,我想出了一个方法来做到这一点......

df3=[]
for index1, row1 in df1.iterrows():
    for index2, row2 in df2.iterrows():
        if row1["Name"] == row2["Name"]:
            x = range(row1["start"],row1["stop"])
            x = set(x)
            y = range(row2["start"],row2["stop"])
            if len(x.intersection(y)) > 0:
                df3.append(row1)
df3 = pd.DataFrame(df3).reset_index(drop=True)
print(df3)

  Name  start  stop
0    B    124   200
1    C    159   200
2    D     12    24
3    D     26    30
4    E    110   160

Gets the job done albeit a bit clumsy.完成工作,虽然有点笨拙。

Would be interested if anyone can suggest a less messy way!如果有人可以建议一种不那么混乱的方式,将会很感兴趣!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM