[英]Using .iterrows() with series.nlargest() to get the highest number in a row in a Dataframe
I am trying to create a function that uses df.iterrows()
and Series.nlargest
.我正在尝试创建一个使用df.iterrows()
和 Series.nlargest 的Series.nlargest
。 I want to iterate over each row and find the largest number and then mark it as a 1
.我想遍历每一行并找到最大的数字,然后将其标记为1
。 This is the data frame:这是数据框:
A B C
9 6 5
3 7 2
Here is the output I wish to have:这是我希望拥有的 output:
A B C
1 0 0
0 1 0
This is the function I wish to use here:这是我想在这里使用的 function:
def get_top_n(df, top_n):
"""
Parameters
----------
df : DataFrame
top_n : int
The top number to get
Returns
-------
top_numbers : DataFrame
Returns the top number marked with a 1
"""
# Implement Function
for row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
return top_numbers
I get the following error: AttributeError: 'tuple' object has no attribute 'nlargest'我收到以下错误:AttributeError: 'tuple' object has no attribute 'nlargest'
Help would be appreciated on how to re-write my function in a neater way and to actually work!如何以更简洁的方式重写我的 function 并实际工作,我们将不胜感激! Thanks in advance提前致谢
Add i
variable, because iterrows
return indices with Series
for each row: 添加i
变量,因为iterrows
为每行返回带有Series
索引:
for i, row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
General solution with numpy.argsort
for positions in descending order , then compare and convert boolean array to integers: numpy.argsort
常规解决方案, numpy.argsort
位置降序排列 ,然后比较并将布尔数组转换为整数:
def get_top_n(df, top_n):
if top_n > len(df.columns):
raise ValueError("Value is higher as number of columns")
elif not isinstance(top_n, int):
raise ValueError("Value is not integer")
else:
arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
return (df1)
df1 = get_top_n(df, 2)
print (df1)
A B C
0 1 1 0
1 1 1 0
df1 = get_top_n(df, 1)
print (df1)
A B C
0 1 0 0
1 0 1 0
EDIT: 编辑:
Solution with iterrows
is possible, but not recommended, because slow: 使用iterrows
解决方案是可行的,但不推荐,因为缓慢:
top_n = 2
for i, row in df.iterrows():
top = row.nlargest(top_n).index
df.loc[i] = 0
df.loc[i, top] = 1
print (df)
A B C
0 1 1 0
1 1 1 0
For context, the dataframe consists of stock return data for the S&P500 over approximately 4 years 对于上下文,数据框包含标准普尔500指数大约4年的股票收益数据
def get_top_n(prev_returns, top_n):
# generate dataframe populated with zeros for merging
top_stocks = pd.DataFrame(0, columns = prev_returns.columns, index = prev_returns.index)
# find top_n largest entries by row
df = prev_returns.apply(lambda x: x.nlargest(top_n), axis=1)
# merge dataframes
top_stocks = top_stocks.merge(df, how = 'right').set_index(df.index)
# return dataframe replacing non_zero answers with a 1
return (top_stocks.notnull()) * 1
Alternatively, the 2-line solution could be或者,2 线解决方案可以是
def get_top_n(df, top_n):
# find top_n largest entries by stock
df = df.apply(lambda x: x.nlargest(top_n), axis=1)
# convert dataframe NaN or float entries True and False, and then convert to 0 and 1
top_numbers = (df.notnull()).astype(np.int)
return top_numbers
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.