簡體   English   中英

Pandas - 根據邏輯語句定位值

[英]Pandas- locate a value based on logical statements

我正在將此數據集用於項目。 我試圖找到數據集 34 天持續時間內每個逆變器的總產量(基本上使用每個逆變器可用的最終值和初始值)。 我已經能夠使用pd.unique()獲得逆變器列表(每個太陽能發電廠有 22 個逆變器。

我無法查詢每個逆變器的total_yield數據。 這是我嘗試過的:

def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
    delta = np.zeros(len(arr))
    index =0
    for i in arr:
        initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
        initial = initial.loc[initial["INVERTER_ID"]==i]
        initial.reset_index(inplace=True,drop=True)
        initial = initial.at[0,"TOTAL_YIELD"]
        final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
        final = final.loc[final["INVERTER_ID"]==i]
        final.reset_index(inplace=True, drop=True)
        final = final.at[0,"TOTAL_YIELD"]

        delta[index] = final - initial
        index = index + 1
    return delta

參考: arr為逆變器數組,如下所列。 df是每個工廠的 dataframe 代。
問題是並非每個逆變器都有每個間隔的數據點。 這使得該 function 僅適用於第一工廠的逆變器,而不適用於第二工廠。

我的第二種方法是先通過逆變器進行過濾,然后獲取第一個和最后一個數據點。 但是我得到一個錯誤-'Series 'Series' objects are mutable, thus they cannot be hashed這里是到目前為止的代碼:

def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
    delta = np.zeros(len(arr))
    index = 0
    for i in arr:
        initial = df.loc(df["INVERTER_ID"] == i)
        index += 1
        break
    return delta

供參考的工廠 1 逆變器列表(標記為SOURCE_KEY ):

['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
 'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
 'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
 'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
 'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
 'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']

2廠逆變器列表:

['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
 'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
 'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
 'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
 'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
 'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']

非常感謝。

我無法下載數據集來測試這一點。 收到“To May Requests”錯誤。

但是,您應該能夠使用groupby來做到這一點。

import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])  

因此,如果我理解這一點,您想要的是從TOTAL_YIELD 5-05-2020 02:00開始到17-06-2020 23:45結束的時間段開始時每個逆變器的 TOTAL_YIELD 。 嘗試這個:

# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr): 
    # to filter the info to between the two dates, but not necessarily assuming that
    # each inverter's data starts and ends at each date
    inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
    inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020      
    23:45:00')]
    inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]

    # sort by date
    inverter_df.sort_values(by='DATE_TIME', inplace= True)
    # grab TOTAL_YIELD at the first available date
    initial = inverter_df['TOTAL_YIELD'].iloc[0]
    # grab TOTAL_YIELD at the last available date
    final = inverter_df['TOTAL_YIELD'].iloc[-1]
    delta[index] = final - initial

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM