简体   繁体   中英

Pandas: check whether at least one of values in duplicates' rows is 1

This problem may be rather specific, but I bet many may encounter this as well. So I have a DataFrame in a form like:

asd = pd.DataFrame({'Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1]})

The resulting table looks like this:

I -- Col1 -- Col2
1 -- a    -- 0
2 -- b    -- 0
3 -- b    -- 0
4 -- a    -- 1
5 -- a    -- 1

What I am trying to do is to:
if at least one "a" value in Col1 has a corresponding value of 1 in Col2 , then in Col3 we put 1 for all values of "a"
otherwise (if not even one "a" has a value of 1), then we put "0" for all values of "a"
And then repeat for all other values in Col1 .

The result of the operation should look like this:

I -- Col1 -- Col2 -- Col3
1 -- a    -- 0    -- 1     because "a" has value of 1 in 4th and 5th lines
2 -- b    -- 0    -- 0     because all "b" have values of 0
3 -- b    -- 0    -- 0
4 -- a    -- 1    -- 1
5 -- a    -- 1    -- 1

Currently I am doing this:

asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes = []

for i in col1_uniques:
    small_df = asd.loc[asd.Col1 == i]
    if small_df.Col2.max() == 1:
        small_df['Col3'] = 1

    small_dataframes.append(small_df)

I then reassemble the dataframe back.

However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.

Is there a better way to do it?

My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,

asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))

    Col1    Col2    Col3
0   a       0       1
1   b       0       0
2   b       0       0
3   a       1       1
4   a       1       1

Option 2: Similar solution as above but using map

d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)

Another method without groupby and faster using np.where and isin :

v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)

print(asd)
  Col1  Col2  Col3
0    a     0     1
1    b     0     0
2    b     0     0
3    a     1     1
4    a     1     1

You can do this with a groupby and an if statement. First group all items by Col1:

lists = asd.groupby("Col1").agg(lambda x: tuple(x))

This gives you:

           Col2
Col1           
a     (0, 1, 1)
b        (0, 0)

You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].

asd["Col3"] = 0
for i in lists.index:
    if 1 in lists.loc[i, "Col2"]:
        asd.loc[asd["Col1"]==i, "Col3"] = 1

This results in:

    Col1    Col2    Col3
0   a   0   1
1   b   0   0
2   b   0   0
3   a   1   1
4   a   1   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM