[英]Slow pandas when looping
I have a pandas dataframe and a set of ids, and want to end out with a result that for a given id, it has the previous and next 5 rows in pandas as a dictionary. 我有一个pandas数据框和一组ID,并且要以一个给定的ID结尾,结果是将pandas中的前5行和后5行作为字典。
To achieve this I wrote the following code where events is a set of ids and df is a pandas dataframe. 为此,我编写了以下代码,其中events是一组ID,而df是pandas数据框。
The issue is that this code runs very slowly as the number of ids approach 1000. Is there a way to make this code run faster without having to loop over the dataset? 问题在于,随着id的数量接近1000,此代码运行非常缓慢。是否有一种方法可以使此代码运行更快而不必遍历数据集?
Here's some sample data: 以下是一些示例数据:
Dataframe
index event_id type timestamp
0 asd12e click 12322232
1 asj123 click 212312312
2 asd321 touch 12312323
3 asdas3 click 33332233
4 sdsaa3 touch 33211333
event_ids
["asd321"]
Given this sample data, I would like to retrieve a dictionary that contains the data for the id (asd321) and the previous and next 2 rows in the dataframe based on the index field, in the following format: 给定此示例数据,我想基于以下索引格式检索一个字典,该字典包含id(asd321)以及数据帧中前两行和后两行的数据。
{id: asd321}
def get_occurence(row, label, next, previous):
return {
"type": row.type,
"timestamp": row.timestamp
"next_occurences": [...],
"prev_occurences": [...]
}
def get_occurences(events, df, N):
occurences = {}
df = df[df.event_id.isin(events)]
for idx, row in df.iterrows():
prev_occurences = get_next_or_prev_occurences(event_id=row.event_id,
df=df,
N=N,
next=False)
next_occurences = get_next_or_prev_occurences(event_id=row.event_id,
df=df,
N=N,
next=True)
occurence = get_occurence(
row=row,
prev_occurences=prev_occurences,
next_occurences=next_occurences)
occurences[row.event_id] = occurence
return occurences
def get_next_or_prev_occurences(event_id, df, N, next):
current_index = df[df.event_id == event_id].index[-1]
if next:
new_df = df.iloc[current_index+1:current_index+N]
else:
new_df = df.iloc[current_index-N:current_index-1]
occurences = []
for idx, row in new_df.iterrows():
occurence = get_occurence(row)
occurences.append(occurence)
return occurences
What about this: 那这个呢:
# get indexes of the given events
matching_indexes = pd.Series(df[df["event_id"].isin(event_ids)].index)
# build extended index list containing the neighbors you are interested in:
# the previous and next five
indexes = pd.concat([matching_indexes - 1,
matching_indexes,
matching_indexes + 1,
matching_indexes + 2,
matching_indexes + 3,
matching_indexes + 4,
matching_indexes + 5]).sort_values().unique()
# avoid overflows
indexes_rest = indexes[(indexes <= df.index.max()) & (indexes >= df.index.min())]
# restrict your dataframe accordingly
df.iloc[indexes_rest, :]
Here is another variant (added later, I think this is closer to the desired output). 这是另一个变体(稍后添加,我认为这更接近所需的输出)。
As before create the matching_indexes . 和以前一样,创建matching_indexes 。 Then:
然后:
# build a dataframe containing all required indexes per event in one row
df1 = pd.DataFrame({"orig_ind": matching_indexes,
"m1": matching_indexes - 1,
"p1": matching_indexes + 1,
"p2": matching_indexes + 2,
"p3": matching_indexes + 3,
"p4": matching_indexes + 4,
"p5": matching_indexes + 5})
# unpivotize and avoid overflows
index_frame = pd.melt(df1, id_vars=["orig_ind"], value_vars=["m1", "p1", "p2", "p3", "p4","p5"])
index_frame = index_frame[(index_frame.value <= df.index.max()) & (index_frame.value >= df.index.min())]
# select the entries from original frame belonging to the indexes
df_new = df.iloc[index_frame.value, :].copy()
# add additional information
df_new["orig_event_id"] = df.iloc[index_frame.orig_ind, :]["event_id"].values
df_new["neighbor_type"] = index_frame["variable"].values
df_new2 = df_new[["orig_event_id", "event_id", "neighbor_type"]].copy()
# produce a dict from the above
as_dict = df_new2.pivot(index="orig_event_id", columns='neighbor_type').to_dict('index')
The result is found in as_dict . 结果在as_dict中找到。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.