简体   繁体   中英

Get Rows based on distinct values from Column 1, while keep as many distinct values from column 2 as possible

I have a matched dataset of treated and control columns. My problem is to pick a control for each treated observation, basically a one-to-one match with replacement, except that I'd like to keep as many unique controls as possible, ie, I'd like to exploit full info in the control group and don't want to give too much weight to a single control observation.

For a specific example, after the match, I have the dataframe below with duplicated values in both treated and control columns:

>>>df
treated control
A    a
A    b
B    a
B    b
C    a
C    b
D    a
D    d

I would like to get the rows based on unique values in treated, while at the same time keep as many unique values from controls as possible. That is, I'd like to get either

>>>df
treated control
A    a
B    b
C    a
D    d

or

>>>df
treated control
A    b
B    a
C    a
D    d

or any output that keeps all unique values of the control column in this example (and maintain the correct pairs). That is, I don't want to get, for example

>>>df
treated control
A    a
B    a
C    a
D    a

Any help is appreciated.

Try pd.unique + pd.Series + ffill :

import pandas as pd

df = pd.DataFrame({
    'col1': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'C', 5: 'C'},
    'col2': {0: 'a', 1: 'b', 2: 'a', 3: 'b', 4: 'a', 5: 'b'}
})

new_df = (
    df.apply(lambda x: pd.Series(pd.unique(x)))
        .ffill()  # fill NaNs with any value
)

print(new_df)

new_df :

  col1 col2
0    A    a
1    B    b
2    C    b

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM