How to merge multiple columns having same column name in one dataframe with rules python pandas

Question

I have a CSV file with multiple columns having the same name. I want to merge the values and keep only the unique columns as outputs with certain rules to choose between two conflicting values. And if the values are the same, just select one. This is how my CSV would look like. (PS My CSV doesn't have headings separated with an underscore. For sake of creating dataframe, I have used underscore.)

df_data_hworkforce = pd.DataFrame({"Country": ['Afghanistan','Bahrain','Djibouti','Egypt','Iran'], 
           "2019": [2.9,28,2.1,8.5,15.2],
            "2019_1": [np.nan,27.9,np.nan,np.nan,np.nan ],
            "2018": [2.9,27.3,1.1,6.5,5.2],
            "2018_1": [2.9,27,2.1,6,np.nan],
            "2017": [3,26,1.9,np.nan,np.nan],
            })

Directly creating same name dataframe was not possible. So doing this to present an example.

df_data_hworkforce.rename(columns = {'2019_1':'2019','2018_1':'2018'},inplace = True)

This is how dataframe looks like

Joining the columns with same name the following way:

def sjoin(x): return ';'.join(x[x.notnull()].astype(str))
df_data_hworkforce.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))

This combines the value of two columns and gives the following results.

However, my desired output is to select only one data when the data is same in both columns and if they are different by less than 0.5, select the not rounded off value. Below is my desired output.

Answer 1

This is a very peculiar data transformation and can not be implemented very efficiently.

However an approach you can take is:

groupby each pairing of data values
agg regate according to your desired threshold & transformation
Update original data

def combine(df, threshold=.5):
    df = df.set_axis([0, 1], axis=1)
    left, right = df.iloc[:, 0], df.iloc[:, 1]
    left = left.fillna(right)
    right = right.fillna(left)
    
    diffs = (left - right).abs()
    non_rounded_values = df.round().ne(df).idxmax(axis=1)
    cat_values = left.astype(str).str.cat(right.astype(str), sep=';')
    
    choices = non_rounded_values.where(diffs < threshold, 2)
    return np.choose(choices, [left, right, cat_values])


import pandas as pd
import numpy as np
df = pd.DataFrame(
    data=zip(*[
        ['Afghanistan','Bahrain','Djibouti','Egypt','Iran'],
        [2.9,28,2.1,8.5,15.2],
        [np.nan,27.9,np.nan,np.nan,np.nan ],
        [2.9,27.3,1.1,6.5,5.2],
        [2.9,27,2.1,6,np.nan],
        [3,26,1.9,np.nan,np.nan],
    ]),
    columns=['Country', '2019', '2019', '2018', '2018', '2017']
)


to_update = (
    df.loc[:, df.columns.duplicated(keep=False)]
    .groupby(level=0, axis=1).agg(combine, threshold=.5)
)

out = df.loc[:, ~df.columns.duplicated()].copy()
out.update(to_update)

print(out)
       Country  2019     2018  2017
0  Afghanistan   2.9      2.9   3.0
1      Bahrain  27.9     27.3  26.0
2     Djibouti   2.1  1.1;2.1   1.9
3        Egypt   8.5  6.5;6.0   NaN
4         Iran  15.2      5.2   NaN

Update: simplified the code since the column names are exact matches.

How to merge multiple columns having same column name in one dataframe with rules python pandas

Question

1 answers

solution1
0 2022-07-26 14:48:02

How to merge multiple columns having same column name in one dataframe with rules python pandas

Question

1 answers

solution1 0 2022-07-26 14:48:02

solution1
0 2022-07-26 14:48:02