如何迭代 dataframe 中的前几行

Question

I have three columns: id (non-unique id), X (categories) and Y (categories).我有三列：id（非唯一 id）、X（类别）和 Y（类别）。 (I don't have a dataset to share yet. I'll try to replicate what I have using a smaller dataset and edit as soon as possible) （我还没有要共享的数据集。我将尝试使用较小的数据集复制我拥有的内容并尽快进行编辑）

I ran a for loop on a very small subset and based on those results it might take over 4 hours to run this code.我在一个非常小的子集上运行了一个 for 循环，根据这些结果，运行这段代码可能需要 4 个多小时。 I'm looking for a faster way to do this task using pandas (maybe using iterrows, like iterating over previous rows within apply)我正在寻找一种使用 pandas 来完成此任务的更快方法（可能使用 iterrows，比如在应用中迭代之前的行）

For each row I check对于我检查的每一行

whether the current X matches any of previous Xs (check_X = X[:row] == X[row])当前 X 是否与之前的任何 X 匹配（check_X = X[:row] == X[row]）
whether the current Y matches any of previous Ys (check_Y = Y[:row] == Y[row])当前 Y 是否与之前的任何 Y 匹配（check_Y = Y[:row] == Y[row]）
whether the current id does not match any of previous ids (check_id = id[:row] != id[row])当前 id 是否与之前的任何 id 不匹配 (check_id = id[:row] != id[row])

if sum(check_X & check_Y & check_id)>0: then append 1 to the array else: append 0 if sum(check_X & check_Y & check_id)>0: then append 1 to the array else: append 0

Answer 1

Your are probably looking for duplicated :您可能正在寻找duplicated的：

df = pd.DataFrame({'id': [0, 0, 0, 1, 0],
                   'X': [1, 1, 2, 1, 1],
                   'Y': [2, 2, 2, 2, 2]})

df['dup'] = ~df[df.duplicated(['X', 'Y'])].duplicated('id', keep=False).loc[lambda x: ~x]
df['dup'] = df['dup'].fillna(False).astype(int)
print(df)

# Output
   id  X  Y  dup
0   0  1  2    0
1   0  1  2    0
2   0  2  2    0
3   1  1  2    1
4   0  1  2    0

Answer 2

EDIT answer from @Corralien using duplicates() will likely be much faster and the best answer for this specific problem.使用duplicates()编辑来自@Corralien 的答案可能会更快，并且是此特定问题的最佳答案。 However, apply is more flexible if you have different things to check.但是，如果您有不同的事情要检查，则 apply 会更加灵活。

You could do it with iterrows() or apply() .您可以使用iterrows()或apply()来完成。 As far as I know apply() is faster:据我所知apply()更快：

check_id, check_x, check_y = set(), set(), set()

def apply_func(row):
    global check_id, check_x, check_y
    if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
        row['duplicate'] = 1
    else:
        row['duplicate'] = 0
    check_id.add(row['id'])
    check_x.add(row['x'])
    check_y.add(row['y'])
    return row

df.apply(apply_func, axis=1)

With iterrows():使用 iterrows()：

check_id, check_x, check_y = set(), set(), set()
for i, row in df.iterrows():
    if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
        df.loc[i, 'duplicate'] = 1
    else:
        df.loc[i, 'duplicate'] = 0
    check_id.add(row['id'])
    check_x.add(row['x'])
    check_y.add(row['y'])

Answer 3

This is essentially like @Corralien's answer.这本质上就像@Corralien 的回答。 What you want can be achieved using duplicated because it returns a Series indicating whether each value has occurred in the preceding values, which is precisely "whether the current X matches any of previous Xs".你想要的可以使用duplicated来实现，因为它返回一个 Series 指示每个值是否出现在前面的值中，这正是“当前 X 是否与之前的任何 X 匹配”。 Then the condition for "id" is just the negation of it.那么“id”的条件就是它的否定。 Since you want 1 if all of them evaluate to True and 0 otherwise in each row, you can do it using the & operator and converting the resulting boolean Series to dtype int:因为如果所有的计算结果都为 True，则需要 1，否则每行中的值为 0，因此您可以使用&运算符并将生成的 boolean 系列转换为 dtype int：

check_X = df['X'].duplicated()
check_Y = df['Y'].duplicated()
check_id = ~df['id'].duplicated()
out = (check_X & check_Y & check_id).astype(int)

如何迭代 dataframe 中的前几行

问题描述

2 个解决方案

解决方案1
1 2022-01-28 23:10:45

解决方案2
0 2022-01-28 23:28:55

解决方案3
0 2022-01-29 04:55:28

如何迭代 dataframe 中的前几行

问题描述

2 个解决方案

解决方案1 1 2022-01-28 23:10:45

解决方案2 0 2022-01-28 23:28:55

解决方案3 0 2022-01-29 04:55:28

解决方案1
1 2022-01-28 23:10:45

解决方案2
0 2022-01-28 23:28:55

解决方案3
0 2022-01-29 04:55:28