简体   繁体   中英

Using pandas, how to filter rows with similar values in two columns

I have a big dataframe (~10 millon rows). Each row has:

  • category
  • start position
  • end position

If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows. For example

1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25

I want to filter out 1 or 2.

What I'm doing right now isn't very efficient,

import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
    dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
    if index in discard:
        continue
    df_2 = dfs[row.category]
    res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
    if len(res.index) > 1:
        discard.extend(res.index.values)
    rows.append(row)
df = pd.DataFrame(rows)

I've also tried a different approach making use of a sorted version of the dataframe.

my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
    row = df.iloc[[my_index]]
    cond = True
    next_index = 1
    while cond:
        second_row = df.iloc[[my_index + next_index]]
        c1 = (row.iloc[0].category == second_row.iloc[0].category)
        c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
        c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
        cond =  c1 and c2 and c3
        if cond and (c2 amd c3):
            indexes.append(my_index)
            cond = True
        next_index += 1
    indexes.append(my_index)
    my_index += next_index
indexes.append(total_len - 1)

The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one

I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.

The approach here should be this:

  1. pandas.groupby by categories
  2. agg(Func) on groupby result
  3. the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)

Do you want to merge all similar or only 2 consecutive? If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row. If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:

merge rows pandas dataframe based on condition

I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:

dfs = []
for seq in df.category.unique():
    dfs[seq] = df[df.category == seq]

Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0) , where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0) .

Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:

def drop_duplicates(df):
    df = df.sort_values("sstart")
    ...

This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM