如何根據另一列的值從數據框中返回唯一對？

Question

我有一個數據框，如下所示：

Date        A   B   Number
2017-01-01  a   b   0.9240
2017-01-01  b   c   0.9101
2017-01-01  d   e   0.8761
2017-01-01  c   g   0.9762
2017-01-02  b   c   0.5637
2017-01-02  c   d   0.9643

我想每天在A和B中都有一個唯一值的數據框，具體取決於數字列中的數字。 我認為邏輯將按以下順序進行：

按日期分組數據框
將A列中的每個值與B列中的每個值進行比較，以檢查是否存在類似的值。
對於所有匹配的值，比較“數字”列並找到兩個值中的最大值。
返回具有唯一值的新數據框。

例如，從上面的數據框中看，由於2017年1月1日A列和B列中有一個'b'，因此我想比較0.9240和0.9101並返回0.9240的行，因為它高於0.9101。

最終產品應如下所示：

Date        A   B   Number
2017-01-01  a   b   0.9240
2017-01-01  d   e   0.8761
2017-01-01  c   g   0.9762
2017-01-02  c   d   0.9643

Answer 1

這很復雜，但是絕對有可能做到。

首先讓我們確保數據的格式正確：

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
Date      6 non-null datetime64[ns]
A         6 non-null object
B         6 non-null object
Number    6 non-null float64
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 272.0+ bytes

請注意，“ Date列的類型為datetime64 。 這是必需的，因為將這些值用作時間戳允許使用熊貓resample方法每天對數據進行分組。

重新采樣數據后，可以應用自定義方法extract 。 此方法將一組作為數據幀並應用邏輯。 通過使用pivot_table方法，可以更輕松地找到A列和B列之間的交點。我不確定這是否是最有效的方法，但是如果數據集不是太大，它應該足夠快地工作。

完整的代碼如下所示：

def extract(df):
    dfs = []
    pt = df.reset_index().pivot_table('Number', columns=['A', 'B'], index='Date')

    # find any intersection of values between col A and B
    intersection = set(pt.columns.levels[0].values)\
        .intersection(set(pt.columns.levels[1].values))
    # iterate over all intersections to compare their values 
    # and choose the largest one
    for value in intersection:
        mask = (df['A'] == value) | (df['B'] == value)
        df_intersection = df[mask]\
            .sort_values('Number', ascending=False)
        dfs.append(df_intersection.ix[[0]])

    # find all rows that do not contain any intersections
    df_rest = df[(~df['A'].isin(list(intersection))) &\
                 (~df['B'].isin(list(intersection)))]
    if (len(df_rest) > 0):
        dfs.append(df_rest)

    return pd.concat(dfs)

df.set_index('Date')\
    .resample('d')\
    .apply(extract)\
    .reset_index(level=1, drop=True)

此代碼導致：

            A  B  Number
Date                    
2017-01-01  a  b  0.9240
2017-01-01  c  g  0.9762
2017-01-01  d  e  0.8761
2017-01-02  c  d  0.9643

上面的代碼基於給定的數據集：

import pandas as pd
from io import StringIO

data = StringIO("""\
Date        A   B   Number
2017-01-01  a   b   0.9240
2017-01-01  b   c   0.9101
2017-01-01  d   e   0.8761
2017-01-01  c   g   0.9762
2017-01-02  b   c   0.5637
2017-01-02  c   d   0.9643
""")
df = pd.read_csv(data, sep='\s+', parse_dates=[0])

如何根據另一列的值從數據框中返回唯一對？

問題描述

1 個解決方案

解決方案1
0 2017-02-02 22:20:40

如何根據另一列的值從數據框中返回唯一對？

問題描述

1 個解決方案

解決方案1 0 2017-02-02 22:20:40

解決方案1
0 2017-02-02 22:20:40