使用 .iterrows() 和 series.nlargest() 获得 Dataframe 中连续的最高数字

Question

I am trying to create a function that uses df.iterrows() and Series.nlargest .我正在尝试创建一个使用df.iterrows()和 Series.nlargest 的Series.nlargest 。 I want to iterate over each row and find the largest number and then mark it as a 1 .我想遍历每一行并找到最大的数字，然后将其标记为1 。 This is the data frame:这是数据框：

A   B    C
9   6    5
3   7    2

Here is the output I wish to have:这是我希望拥有的 output：

A    B   C
1    0   0
0    1   0

This is the function I wish to use here:这是我想在这里使用的 function：

def get_top_n(df, top_n):
    """


    Parameters
    ----------
    df : DataFrame

    top_n : int
        The top number to get
    Returns
    -------
    top_numbers : DataFrame
    Returns the top number marked with a 1

    """
    # Implement Function
    for row in df.iterrows():
        top_numbers = row.nlargest(top_n).sum()

    return top_numbers

I get the following error: AttributeError: 'tuple' object has no attribute 'nlargest'我收到以下错误：AttributeError: 'tuple' object has no attribute 'nlargest'

Help would be appreciated on how to re-write my function in a neater way and to actually work!如何以更简洁的方式重写我的 function 并实际工作，我们将不胜感激！ Thanks in advance提前致谢

Answer 1

Add i variable, because iterrows return indices with Series for each row: 添加i变量，因为iterrows为每行返回带有Series索引：

for i, row in df.iterrows():
    top_numbers = row.nlargest(top_n).sum()

General solution with numpy.argsort for positions in descending order , then compare and convert boolean array to integers: numpy.argsort常规解决方案， numpy.argsort位置降序排列，然后比较并将布尔数组转换为整数：

def get_top_n(df, top_n):
    if top_n > len(df.columns):
        raise ValueError("Value is higher as number of columns")
    elif not isinstance(top_n, int):
        raise ValueError("Value is not integer")

    else:
        arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
        df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
        return (df1)

df1 = get_top_n(df, 2)
print (df1)
   A  B  C
0  1  1  0
1  1  1  0

df1 = get_top_n(df, 1)
print (df1)
   A  B  C
0  1  0  0
1  0  1  0

EDIT: 编辑：

Solution with iterrows is possible, but not recommended, because slow: 使用iterrows解决方案是可行的，但不推荐，因为缓慢：

top_n = 2
for i, row in df.iterrows():
    top = row.nlargest(top_n).index
    df.loc[i] = 0
    df.loc[i, top] = 1

print (df)
   A  B  C
0  1  1  0
1  1  1  0

Answer 2

For context, the dataframe consists of stock return data for the S&P500 over approximately 4 years 对于上下文，数据框包含标准普尔500指数大约4年的股票收益数据

def get_top_n(prev_returns, top_n):

    # generate dataframe populated with zeros for merging
    top_stocks = pd.DataFrame(0, columns = prev_returns.columns, index = prev_returns.index)

    # find top_n largest entries by row
    df = prev_returns.apply(lambda x: x.nlargest(top_n), axis=1)

    # merge dataframes
    top_stocks = top_stocks.merge(df, how = 'right').set_index(df.index)

    # return dataframe replacing non_zero answers with a 1
    return (top_stocks.notnull()) * 1

Answer 3

Alternatively, the 2-line solution could be或者，2 线解决方案可以是

def get_top_n(df, top_n):

    # find top_n largest entries by stock
    df = df.apply(lambda x: x.nlargest(top_n), axis=1)

    # convert dataframe NaN or float entries True and False, and then convert to 0 and 1
    top_numbers = (df.notnull()).astype(np.int)

    return top_numbers

使用 .iterrows() 和 series.nlargest() 获得 Dataframe 中连续的最高数字

问题描述

3 个解决方案

解决方案1
6 已采纳 2018-08-02 05:15:14

解决方案2
2 2018-10-01 20:15:54

解决方案3
1 2021-08-21 03:03:48

使用 .iterrows() 和 series.nlargest() 获得 Dataframe 中连续的最高数字

问题描述

3 个解决方案

解决方案1 6 已采纳 2018-08-02 05:15:14

解决方案2 2 2018-10-01 20:15:54

解决方案3 1 2021-08-21 03:03:48

解决方案1
6 已采纳 2018-08-02 05:15:14

解决方案2
2 2018-10-01 20:15:54

解决方案3
1 2021-08-21 03:03:48