简体   繁体   中英

How to return unique pairs from a dataframe based on another column's values?

I have a dataframe that looks like the following:

Date        A   B   Number
2017-01-01  a   b   0.9240
2017-01-01  b   c   0.9101
2017-01-01  d   e   0.8761
2017-01-01  c   g   0.9762
2017-01-02  b   c   0.5637
2017-01-02  c   d   0.9643

I want to have a dataframe of unique values in A and B for each day, depending on the number in the number column. I think the logic would be in the following order:

  1. group the dataframe by date
  2. compare each value in column A to every value in column B to check if there is a value like it.
  3. for all of the matching values, compare the Number column and find the highest of the two values.
  4. return a new dataframe with the unique values.

As an example, from the dataframe above, because there is a 'b' in column A and column B on Jan 1st, 2017, I want to compare 0.9240 and 0.9101 and return the row with the 0.9240 because it's higher than 0.9101.

The end product should look as follows:

Date        A   B   Number
2017-01-01  a   b   0.9240
2017-01-01  d   e   0.8761
2017-01-01  c   g   0.9762
2017-01-02  c   d   0.9643

It's complex, but absolutely possible to do so.

First let's ensure that the data is in the correct format:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
Date      6 non-null datetime64[ns]
A         6 non-null object
B         6 non-null object
Number    6 non-null float64
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 272.0+ bytes

Note that the Date column is of type datetime64 . This is necessary because having those values as timestamps allows to use pandas resample method to group data on a daily basis.

After resampling the data a custom method extract can be applied. This method gets one group as a data frame and applies the logic. By using pandas pivot_table method it's easier to find the intersection between the columns A and B. I'm not sure if this is the most efficient approach but if the dataset is not too large it should work sufficiently fast.

The full code looks like this:

def extract(df):
    dfs = []
    pt = df.reset_index().pivot_table('Number', columns=['A', 'B'], index='Date')

    # find any intersection of values between col A and B
    intersection = set(pt.columns.levels[0].values)\
        .intersection(set(pt.columns.levels[1].values))
    # iterate over all intersections to compare their values 
    # and choose the largest one
    for value in intersection:
        mask = (df['A'] == value) | (df['B'] == value)
        df_intersection = df[mask]\
            .sort_values('Number', ascending=False)
        dfs.append(df_intersection.ix[[0]])

    # find all rows that do not contain any intersections
    df_rest = df[(~df['A'].isin(list(intersection))) &\
                 (~df['B'].isin(list(intersection)))]
    if (len(df_rest) > 0):
        dfs.append(df_rest)

    return pd.concat(dfs)

df.set_index('Date')\
    .resample('d')\
    .apply(extract)\
    .reset_index(level=1, drop=True)

This code results in:

            A  B  Number
Date                    
2017-01-01  a  b  0.9240
2017-01-01  c  g  0.9762
2017-01-01  d  e  0.8761
2017-01-02  c  d  0.9643

The code above is based on the given dataset:

import pandas as pd
from io import StringIO

data = StringIO("""\
Date        A   B   Number
2017-01-01  a   b   0.9240
2017-01-01  b   c   0.9101
2017-01-01  d   e   0.8761
2017-01-01  c   g   0.9762
2017-01-02  b   c   0.5637
2017-01-02  c   d   0.9643
""")
df = pd.read_csv(data, sep='\s+', parse_dates=[0])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM