[英]How to iterate over previous rows in a dataframe
I have three columns: id (non-unique id), X (categories) and Y (categories).我有三列:id(非唯一 id)、X(类别)和 Y(类别)。 (I don't have a dataset to share yet. I'll try to replicate what I have using a smaller dataset and edit as soon as possible)
(我还没有要共享的数据集。我将尝试使用较小的数据集复制我拥有的内容并尽快进行编辑)
I ran a for loop on a very small subset and based on those results it might take over 4 hours to run this code.我在一个非常小的子集上运行了一个 for 循环,根据这些结果,运行这段代码可能需要 4 个多小时。 I'm looking for a faster way to do this task using pandas (maybe using iterrows, like iterating over previous rows within apply)
我正在寻找一种使用 pandas 来完成此任务的更快方法(可能使用 iterrows,比如在应用中迭代之前的行)
For each row I check对于我检查的每一行
if sum(check_X & check_Y & check_id)>0: then append 1 to the array else: append 0 if sum(check_X & check_Y & check_id)>0: then append 1 to the array else: append 0
Your are probably looking for duplicated
:您可能正在寻找
duplicated
的:
df = pd.DataFrame({'id': [0, 0, 0, 1, 0],
'X': [1, 1, 2, 1, 1],
'Y': [2, 2, 2, 2, 2]})
df['dup'] = ~df[df.duplicated(['X', 'Y'])].duplicated('id', keep=False).loc[lambda x: ~x]
df['dup'] = df['dup'].fillna(False).astype(int)
print(df)
# Output
id X Y dup
0 0 1 2 0
1 0 1 2 0
2 0 2 2 0
3 1 1 2 1
4 0 1 2 0
EDIT answer from @Corralien using duplicates()
will likely be much faster and the best answer for this specific problem.使用
duplicates()
编辑来自@Corralien 的答案可能会更快,并且是此特定问题的最佳答案。 However, apply is more flexible if you have different things to check.但是,如果您有不同的事情要检查,则 apply 会更加灵活。
You could do it with iterrows()
or apply()
.您可以使用
iterrows()
或apply()
来完成。 As far as I know apply()
is faster:据我所知
apply()
更快:
check_id, check_x, check_y = set(), set(), set()
def apply_func(row):
global check_id, check_x, check_y
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
row['duplicate'] = 1
else:
row['duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])
return row
df.apply(apply_func, axis=1)
With iterrows():使用 iterrows():
check_id, check_x, check_y = set(), set(), set()
for i, row in df.iterrows():
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
df.loc[i, 'duplicate'] = 1
else:
df.loc[i, 'duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])
This is essentially like @Corralien's answer.这本质上就像@Corralien 的回答。 What you want can be achieved using
duplicated
because it returns a Series indicating whether each value has occurred in the preceding values, which is precisely "whether the current X matches any of previous Xs".你想要的可以使用
duplicated
来实现,因为它返回一个 Series 指示每个值是否出现在前面的值中,这正是“当前 X 是否与之前的任何 X 匹配”。 Then the condition for "id" is just the negation of it.那么“id”的条件就是它的否定。 Since you want 1 if all of them evaluate to True and 0 otherwise in each row, you can do it using the
&
operator and converting the resulting boolean Series to dtype int:因为如果所有的计算结果都为 True,则需要 1,否则每行中的值为 0,因此您可以使用
&
运算符并将生成的 boolean 系列转换为 dtype int:
check_X = df['X'].duplicated()
check_Y = df['Y'].duplicated()
check_id = ~df['id'].duplicated()
out = (check_X & check_Y & check_id).astype(int)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.