[英]Pandas- locate a value based on logical statements
I am using the this dataset for a project.我正在将此数据集用于项目。 I am trying to find the total yield for each inverter for the 34 day duration of the dataset (basically use the final and initial value available for each inverter).我试图找到数据集 34 天持续时间内每个逆变器的总产量(基本上使用每个逆变器可用的最终值和初始值)。 I have been able to get the list of inverters using pd.unique()
(there are 22 inverters for each solar power plant.我已经能够使用pd.unique()
获得逆变器列表(每个太阳能发电厂有 22 个逆变器。
I am having trouble querying the total_yield
data for each inverter.我无法查询每个逆变器的total_yield
数据。 Here is what I have tried:这是我尝试过的:
def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
delta = np.zeros(len(arr))
index =0
for i in arr:
initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
initial = initial.loc[initial["INVERTER_ID"]==i]
initial.reset_index(inplace=True,drop=True)
initial = initial.at[0,"TOTAL_YIELD"]
final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
final = final.loc[final["INVERTER_ID"]==i]
final.reset_index(inplace=True, drop=True)
final = final.at[0,"TOTAL_YIELD"]
delta[index] = final - initial
index = index + 1
return delta
Reference: arr
is the array of inverters, listed below.参考: arr
为逆变器数组,如下所列。 df
is the generation dataframe for each plant. df
是每个工厂的 dataframe 代。
The problem is that not every inverter has a data point for each interval.问题是并非每个逆变器都有每个间隔的数据点。 This makes this function only work for the inverters at the first plant, not the second one.这使得该 function 仅适用于第一工厂的逆变器,而不适用于第二工厂。
My second approach was to filter by the inverter first, then take the first and last data points.我的第二种方法是先通过逆变器进行过滤,然后获取第一个和最后一个数据点。 But I get an error- 'Series' objects are mutable, thus they cannot be hashed
Here is the code for that so far:但是我得到一个错误-'Series 'Series' objects are mutable, thus they cannot be hashed
这里是到目前为止的代码:
def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
delta = np.zeros(len(arr))
index = 0
for i in arr:
initial = df.loc(df["INVERTER_ID"] == i)
index += 1
break
return delta
List of inverters at plant 1 for reference(labeled as SOURCE_KEY
):供参考的工厂 1 逆变器列表(标记为SOURCE_KEY
):
['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']
List of inverters at plant 2: 2厂逆变器列表:
['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']
Thank you very much.非常感谢。
I can't download the dataset to test this.我无法下载数据集来测试这一点。 Getting "To May Requests" Error.收到“To May Requests”错误。
However, you should be able to do this with a groupby
.但是,您应该能够使用groupby
来做到这一点。
import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])
So if I'm understanding this right, what you want is the TOTAL_YIELD
for each inverter for the beginning of the time period starting 5-05-2020 02:00
and ending 17-06-2020 23:45
.因此,如果我理解这一点,您想要的是从TOTAL_YIELD
5-05-2020 02:00
开始到17-06-2020 23:45
结束的时间段开始时每个逆变器的 TOTAL_YIELD 。 Try this:尝试这个:
# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr):
# to filter the info to between the two dates, but not necessarily assuming that
# each inverter's data starts and ends at each date
inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020
23:45:00')]
inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]
# sort by date
inverter_df.sort_values(by='DATE_TIME', inplace= True)
# grab TOTAL_YIELD at the first available date
initial = inverter_df['TOTAL_YIELD'].iloc[0]
# grab TOTAL_YIELD at the last available date
final = inverter_df['TOTAL_YIELD'].iloc[-1]
delta[index] = final - initial
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.