[英]Pandas Dataframe: for a given row, trying to assign value in a certain column based on a lookup of a value in another column
Basically for a given row i, I am trying to assign i's value in the column 'Adj', to a certain value based on i's value in another column 'Local Max String'. 基本上,对于给定的行i,我试图根据另一列“本地最大字符串”中i的值,将“ Adj”列中的i值分配给某个值。 Basically row i's value in 'Local Max String' needs to be searched up in another column of the DataFrame, 'Date String', and then the row that contains the value, row q, has it's value in the column 'Adj Close' be the value for row i's 'Adj' column.
基本上,需要在DataFrame的另一列“日期字符串”中搜索“本地最大字符串”中的第i行的值,然后,包含值q的行在“调整结束”列中具有该值。第i行的“ Adj”列的值。
Sorry if that is difficult to understand. 抱歉,如果很难理解。 The following for loop accomplished what I wanted to do, but I think there should be a better way to do it in Pandas.
下面的for循环完成了我想做的事情,但是我认为在Pandas中应该有更好的方法。 I tried using apply and lambda functions, but it said assignment wasn't possible, and I'm unsure if the way I was doing it was correct.
我尝试使用apply和lambda函数,但是它说不可能进行赋值,而且我不确定我的操作方式是否正确。 The for loop also takes extremely long to complete.
for循环还需要花费很长时间才能完成。
Here's the code: 这是代码:
for x in range(0, len(df.index)):
df['Adj'][x] = df.loc[df['Date String'] == df['Local Max String'][x]]['Adj Close']
Here's a picture of the DF to get a better idea of what I mean. 这是DF的图片,可以更好地理解我的意思。 The value in the Adj column will look for the Adj Close value corresponding to the Date in Local Max String.
“调整”列中的值将查找与“本地最大字符串”中的“日期”相对应的“调整结束”值。
import numpy as np
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
from pandas_datareader import data as pdr
import matplotlib.pyplot as plt
import datetime
import fix_yahoo_finance as yf
yf.pdr_override() # <== that's all it takes :-)
# Dates for data
start_date = datetime.datetime(2017,11,1)
end_date = datetime.datetime(2018,11,1)
df = pdr.get_data_yahoo('SPY', start=start_date, end=end_date)
df.data = df['Adj Close']
df['Most Recent Local Max'] = np.nan
df['Date'] = df.index
local_maxes = list(df[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)].index)
local_maxes.append(df['Date'][0] - datetime.timedelta(days=1))
def nearest(items, pivot):
return min([d for d in items if d< pivot], key=lambda x: abs(x - pivot))
df['Most Recent Local Max'] = df['Date'].apply(lambda x: min([d for d in local_maxes if d < x], key=lambda y: abs(y - x)) )
df['Local Max String'] = df['Most Recent Local Max'].apply(lambda x: str(x))
df['Date String'] = df['Date'].apply(lambda x: str(x))
df.loc[df['Local Max String'] == str(df['Date'][0] - datetime.timedelta(days=1)), 'Local Max String'] = str(df['Date'][0])
df['Adj'] = np.nan
Thanks! 谢谢!
This solution still has a for, but it reduces the amount of iterations from df.shape[1]
to df['Local Max String'].nunique()
, so it may be fast enough: 该解决方案仍然具有for,但是将迭代次数从
df.shape[1]
到df['Local Max String'].nunique()
,因此它可能足够快:
for a_local_max in df['Local Max String'].unique():
df.loc[df['Date String'] == a_local_max, 'Adj'] = df.loc[df['Local Max String'] == a_local_max, 'Adj Close'].iloc[0]
Often you can skip the for loop by using apply-like function in pandas
. 通常,您可以在
pandas
使用类似于apply的函数来跳过for循环。 Hereafter, I define a wrapper
function which combines variables row-wisely. 此后,我定义了一个
wrapper
函数,该函数按行组合变量。 Finally this function is applied on the data frame to create the result
variable. 最后,将此函数应用于数据框以创建
result
变量。 The key element here is to think on the row level within the wrapper
function and indicate this behaviour to the apply
function with the axis=1
argument. 这里的关键元素是考虑
wrapper
函数内的行级别,并使用axis=1
参数apply
这种行为指示给apply
函数。
import pandas as pd
import numpy as np
# Dummy data containing two columns with overlapping data
df = pd.DataFrame({'date': 100*np.random.sample(10000), 'string': 2500*['hello', 'world', '!', 'mars'], 'another_string': 10000*['hello']})
# Here you define the operation at the row level
def wrapper(row):
# uncomment if the transformation is to be applied to every column:
# return 2*row['date']
# if you need to first test some condition:
if row['string'] == row['another_string']:
return 2*row['date']
else:
return 0
# Finally you generate the new column using the operation defined above.
df['result'] = df.apply(wrapper, axis=1)
This code completes in 195 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 该代码在每个循环195 ms±1.96 ms中完成(平均±标准偏差,共运行7次,每个循环1次)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.