簡體   English   中英

根據列中的匹配值以及匹配值的最小/最大值時間戳過濾 Dataframe

[英]Filter Dataframe based on matched values in a column, and on min/max values timestamp of those values that matched

我有一個 email 地址列表,我想在有序字典中找到匹配項,然后將其轉換為 dataframe。

這是我的 email 地址列表:

email_list = ['c@aol.com','g@aol.com','b@aol.com','a@aol.com']

這是我的字典變成了 DataFrame (df2):

    sender      type          _time
0  c@aol.com      email   2020-12-09 19:45:48.013140
1  c@aol.com      email    2020-13-09 19:45:48.013140
2  g@aol.com      email   2020-12-09 19:45:48.013140
3  b@aol.com      email    2020-14-11 19:45:48.013140

我想創建一個新的 DataFrame 顯示匹配發件人的列、匹配數(計數)、第一次看到日期和最后一次看到日期。 全部由匹配的發件人分組。 第一次看到的日期將是匹配發件人的 _time 列中的最小時間戳,最后看到的列值將是匹配發件人的 _time 列中的最大時間戳。

腳本運行后的示例 output 如下所示:

      sender  count      type          first_seen            last_seen
0  c@aol.com   2        email   2020-12-09 19:45:48.013140   2020-13-09 19:45:48.013140
1  g@aol.com   1        email   2020-12-09 19:45:48.013140   2020-12-09 19:45:48.013140
2  b@aol.com   1        email    2020-14-11 19:45:48.013140   2020-14-11 19:45:48.013140
3  a@aol.com   0        email             NA                     NA

到目前為止,這是我的 python:

#Collect list of email addresses I want to find in df2
email_list = ['c@aol.com','g@aol.com','b@aol.com','a@aol.com']

# Turn email list into a dataframe
df1 = pd.DataFrame(email_list, columns=['sender'])

# Collect the table that holds the dictionary of emails sent
email_result_dict = {'sender': ['c@aol.com','c@aol.com','g@aol.com','b@aol.com',], 'type': ['email','email','email','email'], '_time': [' 2020-12-09 19:45:48.013140','2020-13-09 19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-09 19:45:48.013140']}

# Turn dictionary into dataframe
df2 = pd.DataFrame.from_dict(email_result_dict)

# Calculate stats
c = df2.loc[df2['sender'].isin(df1['sender'].values)].groupby('sender').size().reset_index()
output = df1.merge(c, on='sender', how='left').fillna(0)
output['first_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmin] # Get the earliest value in '_time' column
output['last_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmax] # Get the latest value in '_time' column

# Set the columns of the new dataframe
output.columns = ['sender', 'count','first_seen', 'last_seen']

關於如何在 dataframe 中獲得我預期的 output 的任何想法或建議? 我已經嘗試了一切,並且一直卡在為每個計數大於 0 的匹配獲取 first_seen 和 last_seen 值。

根據您的輸入df ,您可以執行Groupby.agg

In [1190]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1191]: res
Out[1191]: 
      sender   type                       _time                                  
                                            min                         max count
0  b@aol.com  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140     1
1  c@aol.com  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140     2
2  g@aol.com  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140     1

編輯:要刪除嵌套列,請執行以下操作:

In [1206]: res.columns = res.columns.droplevel()

In [1207]: res
Out[1207]: 
                                            min                         max  count
0  b@aol.com  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
1  c@aol.com  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
2  g@aol.com  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1

EDIT-2:也使用df1

In [1246]: df = df1.merge(df, how='left')
In [1254]: df.type = df.type.fillna('email')

In [1259]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1260]: res.columns = res.columns.droplevel()

In [1261]: res
Out[1261]: 
                                            min                         max  count
0  a@aol.com  email                         NaN                         NaN      0
1  b@aol.com  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
2  c@aol.com  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
3  g@aol.com  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1

我相信這段代碼可以解決問題。

數據點創建:

    data = pd.DataFrame()
    data['sender'] = ['c@aol.com','c@aol.com','g@aol.com','b@aol.com']
    data['type'] = 'email'
    data['_time'] = ['2020-12-09 19:45:48.013140','2020-13-09 
    19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-11 19:45:48.013140']

使用預期的列創建一個新的 df:

    new_data = pd.DataFrame(columns = 
    ['count','first_seen','last_seen','sender','type'] )
    new_data['sender'] = list(set(data['sender'].values)) #data from input df
    new_data['type'] = 'email' #constant

遍歷唯一發件人列表:

     for j in new_data['sender']:
       temp_data = data[data['sender'] == j] #data with only a particular sender
       new_data.loc[new_data['sender'] == j, 'count'] = len(temp_data)#count

       if len(temp_data) > 1:#if multiple timings for a sender
            timings = list(set(temp_data['_time']))#get all possible timings for sender
            new_data.loc[new_data['sender'] == j, 'first_seen'] = min(timings)
            new_data.loc[new_data['sender'] == j, 'last_seen'] = max(timings)
    
       elif len(temp_data) == 1:#if single timimngs per sender
            new_data.loc[new_data['sender'] == j, 'first_seen'] = new_data.loc[new_data['sender'] == j, 'last_seen'] = temp_data.iloc[0]['_time']

您將在 new_data df 中找到所需的格式

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM