简体   繁体   中英

Find and eliminate pair reciprocals from data in python

This a puzzling one. I am refactoring some code, so key-objective is/was to keep things simple.

The piece of code is supposed to locate existing reciprocal pairs in the date, so I can process them (in another function). It has to return the location of reciprocals, the lines of the df where they occur. Reciprocal meaning (a,b) == (b,a).

Explaining my scenario: I did my best to recreate the situation in plain data below. The original data set is indeed a list of all possible permutations of "n" elements in pairs. At that point in the code (a,b) could be a different thing vs. (b,a) hence I need to keep them there and work with all of them.

I used "random" to simulate the fact that at the point of the code where I am, some of the reciprocals have been already eliminated thru other filters/processing.

Then the function I am working in now, "reciprocals_locator", should point me to the rows of the the dataframe where reciprocals still occur and need to be further processed.

Original solution, which I am trying to refactor, is a complicated loop that manipulated the dataframe way within the loop. All of which I did not like to have in my final code. I was able to refactor that loop to create the list I need and then refactor that into a list comprehension, which is better, but nonetheless still quite unreadable.

So here goes question (1): Is there a better way/algorithm or even external/imported function to do the trick in a leaner way? Now, the above solution is somewhat incomplete in the sense it returns a list which also contains reciprocal-duplicates of the locations!

As shown below, I end up with knowing that there are duplicates in rows (1,5) but it also shows the reciprocal solution (5,1)! I initially thought this was a simple case of "pick the half" of the resulting list, but that won't do the trick probably because of the way itertools.product works, which frankly I haven't deeply studied.

Same happens on my original, looping, code implementation. Maybe there's a simpler, itertools-based solution to this I am now aware of.

**Hope all code below is simple enough to follow

### Problem scenario recreation

import itertools as it
import pandas as pd
import random as rd

## Original dataset
sourcelst = ['a','b','c','d','e']
pairs = [list(perm) for perm in it.permutations(sourcelst,2)]
df = pd.DataFrame(pairs, columns=['y','x'])

## Dataset after some sort of processing, some reciprocals are eliminated,
## but some stay and we nee to process them separately
drop_rows = [rd.randint(0, len(pairs)-1) for _ in range(2)]
df.drop(index=drop_rows, inplace=True)
df.reset_index(inplace=True, drop=True)


## Finding reciprocals

### Original LOOPS implementation, this one shows the actual pairs
### for ease of analysis
def reciprocal_pairs_orig(df):
    reciprocals = []
    row_y = 0
    while row_y < len(df.index):
        for row_x in range(len(df.index)):
            if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y]):
                
                reciprocals.append([[df['y'].iloc[row_x], df['x'].iloc[row_x]],[df['y'].iloc[row_y], df['x'].iloc[row_y]]])
        row_y += 1
    return reciprocals

### List comprehension refactor, showing the pairs
def reciprocal_pairs_refactor(df):
    return [[[df['y'].iloc[row_x], df['x'].iloc[row_x]], [df['y'].iloc[row_y], df['x'].iloc[row_y]]]
        for row_y, row_x in it.product(range(len(df.index)), range(len(df.index)))
        if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y])]


### This is the actual function that I want to use
def reciprocal_locator_orig(df):
    reciprocals = []
    row_y = 0
    while row_y < len(df.index):
        for row_x in range(len(df.index)):
            if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y]):
                
                reciprocals.append([row_y, row_x])
        row_y += 1
    return reciprocals

### List comprehension refactor
def reciprocal_locator_refactor(df):
    return [[row_y, row_x]
        for row_y, row_x in it.product(range(len(df.index)), range(len(df.index)))
        if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y])]

product(range(n),range(n)) will give you n**2 pairs - for instance, as you noted, both (1,5) and (5,1) , so it's not a good solution. Your original code, with the nested for loops, does the same.

A first improvement would be

for x in range(n-1):
    for y in range(x,n):
        ...

This would still be O(n2), but at least it would check each possible pairing only once.

But a different approach may be more efficient: loop over the rows only once, while maintaining a dict of frozenset s (normal set s are not hashable) containing the pairs you already met as keys, and the row numbers as values . For each row you simply check if the pair is already in the dict - then it's a "reciprocal" between the current line and the line whose number is stored in the dict - else you add the new pair, and so on. Soething like:

seen = {}
for row in range(len(df.index)):
    fs = frozenset([df['x'].iloc[row], df['y'].iloc[row]])
    if fs in seen:
        reciprocals.append((seen[fs],row))
    else:
        seen[fs] = row

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM