[英]Counting Pattern Occurence in Pandas DataFrame Column
我有一個 Pandas DataFrame 有一列,其中每一行都包含一個列表。 我想知道從列表列中識別模式的最有效方法/最佳實踐是什么 - 例如,接受之前的平均拒絕數。 (見下面的例子)
sequence_of_selection
0 Accept,Reject,Reject,Reject,Reject,Accept,Reje...
1 Accept,Reject,Reject,Reject,Reject,Reject,Reje...
2 Reject,Accept,Accept,Reject,Reject,Reject,Acce...
3 Accept,Reject,Accept,Accept,Accept,Accept,Reje...
4 Reject,Accept,Reject,Accept,Reject,Reject,Acce...
我可以將數據轉換為字符串並將它們拆分或在字符串中搜索子字符串等等,但我更願意尋求更有效的方法,因為 Python 字符串是不可變的。
任何建議/幫助表示贊賞。
由於它們是列表,因此您可以獲取'Accept'
的index
,然后取這些索引的平均值。 如果索引為 0,則列表中的第一項是'Accept'
,因此它之前'Reject'
為零,依此類推。
df['sequence_of_selection'].apply(lambda x: x.index('Accept')).mean()
這是一個非常好的問題! 我想強調事件時間間隔的力量——有很多基於發布的序列的行為和可預測性的洞察力。 考慮到這一點,我寫了一個長答案,希望能解釋一些數據操作的核心原則。
1. 創建一個自定義 function 來執行您的計算:
(假設你只申請一個列表——我實際上建議在調試或測試時提取一個列表)
def event_metrics(my_list, look_for = "Accept", exclude_zeros=True, simple=True):
"""
Simple mode:
Returns the average number of `items` before `look_for`
Non-Simple mode:
Returns a dictionary with the mean, median, and max number of `items`
before `look_for`
--
my_list: a list of values
look_for: An item in the list which constitutes the "event"
Example: "accept" from a list of "accept" and "reject"
exclude_zeros: exclude metrics for when `look_for` occurs back to back
simple: operate in simple mode or non-simple mode
"""
# Instantiate a counter list
my_counter = []
n = 0
# Loop through the list
for x in my_list:
# If a match, add n to the list and reset
if x==look_for:
my_counter.append(n)
n=0
# Otherwise, continue
else:
n+=1
# Sometimes you might want to append the final n at conclusion of the loop
# You could do that with the following code:
# if x!=look_for:
# my_counter.append(n)
# You may not want to include back-to-back events
if exclude_zeros:
my_counter = [x for x in my_counter if x>0]
# You can return a specific metric such as mean
if simple:
return np.mean(my_counter)
# Or you can pass several metrics as a dictionary and convert to a series
my_metrics = {
"mean":np.mean(my_counter),
"median":np.median(my_counter),
"max":np.max(my_counter)
}
return my_metrics
2.將此自定義 function 應用於您的 df:
pd.to_Series
轉換為多列。 使用pd.merge
添加到原始df
。# Simple Mode
df["sequence_of_selection"].apply(event_metrics, simple=True)
# Non-Simple Mode
temp_df = df["sequence_of_selection"].apply(event_metrics, simple=False)\
.apply(pd.Series)\ # Convert to its own df
.add_prefix("rej_") # Add a prefix to your column names
df.merge(temp_df,left_index=True,right_index=True)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.