繁体   English   中英

Pandas - 根据逻辑语句定位值

[英]Pandas- locate a value based on logical statements

我正在将此数据集用于项目。 我试图找到数据集 34 天持续时间内每个逆变器的总产量(基本上使用每个逆变器可用的最终值和初始值)。 我已经能够使用pd.unique()获得逆变器列表(每个太阳能发电厂有 22 个逆变器。

我无法查询每个逆变器的total_yield数据。 这是我尝试过的:

def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
    delta = np.zeros(len(arr))
    index =0
    for i in arr:
        initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
        initial = initial.loc[initial["INVERTER_ID"]==i]
        initial.reset_index(inplace=True,drop=True)
        initial = initial.at[0,"TOTAL_YIELD"]
        final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
        final = final.loc[final["INVERTER_ID"]==i]
        final.reset_index(inplace=True, drop=True)
        final = final.at[0,"TOTAL_YIELD"]

        delta[index] = final - initial
        index = index + 1
    return delta

参考: arr为逆变器数组,如下所列。 df是每个工厂的 dataframe 代。
问题是并非每个逆变器都有每个间隔的数据点。 这使得该 function 仅适用于第一工厂的逆变器,而不适用于第二工厂。

我的第二种方法是先通过逆变器进行过滤,然后获取第一个和最后一个数据点。 但是我得到一个错误-'Series 'Series' objects are mutable, thus they cannot be hashed这里是到目前为止的代码:

def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
    delta = np.zeros(len(arr))
    index = 0
    for i in arr:
        initial = df.loc(df["INVERTER_ID"] == i)
        index += 1
        break
    return delta

供参考的工厂 1 逆变器列表(标记为SOURCE_KEY ):

['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
 'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
 'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
 'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
 'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
 'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']

2厂逆变器列表:

['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
 'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
 'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
 'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
 'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
 'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']

非常感谢。

我无法下载数据集来测试这一点。 收到“To May Requests”错误。

但是,您应该能够使用groupby来做到这一点。

import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])  

因此,如果我理解这一点,您想要的是从TOTAL_YIELD 5-05-2020 02:00开始到17-06-2020 23:45结束的时间段开始时每个逆变器的 TOTAL_YIELD 。 尝试这个:

# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr): 
    # to filter the info to between the two dates, but not necessarily assuming that
    # each inverter's data starts and ends at each date
    inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
    inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020      
    23:45:00')]
    inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]

    # sort by date
    inverter_df.sort_values(by='DATE_TIME', inplace= True)
    # grab TOTAL_YIELD at the first available date
    initial = inverter_df['TOTAL_YIELD'].iloc[0]
    # grab TOTAL_YIELD at the last available date
    final = inverter_df['TOTAL_YIELD'].iloc[-1]
    delta[index] = final - initial

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM