简体   繁体   English

Pandas - 根据逻辑语句定位值

[英]Pandas- locate a value based on logical statements

I am using the this dataset for a project.我正在将此数据集用于项目。 I am trying to find the total yield for each inverter for the 34 day duration of the dataset (basically use the final and initial value available for each inverter).我试图找到数据集 34 天持续时间内每个逆变器的总产量(基本上使用每个逆变器可用的最终值和初始值)。 I have been able to get the list of inverters using pd.unique() (there are 22 inverters for each solar power plant.我已经能够使用pd.unique()获得逆变器列表(每个太阳能发电厂有 22 个逆变器。

I am having trouble querying the total_yield data for each inverter.我无法查询每个逆变器的total_yield数据。 Here is what I have tried:这是我尝试过的:

def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
    delta = np.zeros(len(arr))
    index =0
    for i in arr:
        initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
        initial = initial.loc[initial["INVERTER_ID"]==i]
        initial.reset_index(inplace=True,drop=True)
        initial = initial.at[0,"TOTAL_YIELD"]
        final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
        final = final.loc[final["INVERTER_ID"]==i]
        final.reset_index(inplace=True, drop=True)
        final = final.at[0,"TOTAL_YIELD"]

        delta[index] = final - initial
        index = index + 1
    return delta

Reference: arr is the array of inverters, listed below.参考: arr为逆变器数组,如下所列。 df is the generation dataframe for each plant. df是每个工厂的 dataframe 代。
The problem is that not every inverter has a data point for each interval.问题是并非每个逆变器都有每个间隔的数据点。 This makes this function only work for the inverters at the first plant, not the second one.这使得该 function 仅适用于第一工厂的逆变器,而不适用于第二工厂。

My second approach was to filter by the inverter first, then take the first and last data points.我的第二种方法是先通过逆变器进行过滤,然后获取第一个和最后一个数据点。 But I get an error- 'Series' objects are mutable, thus they cannot be hashed Here is the code for that so far:但是我得到一个错误-'Series 'Series' objects are mutable, thus they cannot be hashed这里是到目前为止的代码:

def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
    delta = np.zeros(len(arr))
    index = 0
    for i in arr:
        initial = df.loc(df["INVERTER_ID"] == i)
        index += 1
        break
    return delta

List of inverters at plant 1 for reference(labeled as SOURCE_KEY ):供参考的工厂 1 逆变器列表(标记为SOURCE_KEY ):

['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
 'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
 'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
 'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
 'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
 'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']

List of inverters at plant 2: 2厂逆变器列表:

['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
 'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
 'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
 'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
 'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
 'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']

Thank you very much.非常感谢。

I can't download the dataset to test this.我无法下载数据集来测试这一点。 Getting "To May Requests" Error.收到“To May Requests”错误。

However, you should be able to do this with a groupby .但是,您应该能够使用groupby来做到这一点。

import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])  

So if I'm understanding this right, what you want is the TOTAL_YIELD for each inverter for the beginning of the time period starting 5-05-2020 02:00 and ending 17-06-2020 23:45 .因此,如果我理解这一点,您想要的是从TOTAL_YIELD 5-05-2020 02:00开始到17-06-2020 23:45结束的时间段开始时每个逆变器的 TOTAL_YIELD 。 Try this:尝试这个:

# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr): 
    # to filter the info to between the two dates, but not necessarily assuming that
    # each inverter's data starts and ends at each date
    inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
    inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020      
    23:45:00')]
    inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]

    # sort by date
    inverter_df.sort_values(by='DATE_TIME', inplace= True)
    # grab TOTAL_YIELD at the first available date
    initial = inverter_df['TOTAL_YIELD'].iloc[0]
    # grab TOTAL_YIELD at the last available date
    final = inverter_df['TOTAL_YIELD'].iloc[-1]
    delta[index] = final - initial

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM