简体   繁体   中英

How to merge multiple columns having same column name in one dataframe with rules python pandas

I have a CSV file with multiple columns having the same name. I want to merge the values and keep only the unique columns as outputs with certain rules to choose between two conflicting values. And if the values are the same, just select one. This is how my CSV would look like. (PS My CSV doesn't have headings separated with an underscore. For sake of creating dataframe, I have used underscore.)

df_data_hworkforce = pd.DataFrame({"Country": ['Afghanistan','Bahrain','Djibouti','Egypt','Iran'], 
           "2019": [2.9,28,2.1,8.5,15.2],
            "2019_1": [np.nan,27.9,np.nan,np.nan,np.nan ],
            "2018": [2.9,27.3,1.1,6.5,5.2],
            "2018_1": [2.9,27,2.1,6,np.nan],
            "2017": [3,26,1.9,np.nan,np.nan],
            })

Directly creating same name dataframe was not possible. So doing this to present an example.

df_data_hworkforce.rename(columns = {'2019_1':'2019','2018_1':'2018'},inplace = True)

This is how dataframe looks like在此处输入图像描述

Joining the columns with same name the following way:

def sjoin(x): return ';'.join(x[x.notnull()].astype(str))
df_data_hworkforce.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))

This combines the value of two columns and gives the following results.

在此处输入图像描述

However, my desired output is to select only one data when the data is same in both columns and if they are different by less than 0.5, select the not rounded off value. Below is my desired output.

在此处输入图像描述

This is a very peculiar data transformation and can not be implemented very efficiently.

However an approach you can take is:

  1. groupby each pairing of data values
  2. agg regate according to your desired threshold & transformation
  3. Update original data
def combine(df, threshold=.5):
    df = df.set_axis([0, 1], axis=1)
    left, right = df.iloc[:, 0], df.iloc[:, 1]
    left = left.fillna(right)
    right = right.fillna(left)
    
    diffs = (left - right).abs()
    non_rounded_values = df.round().ne(df).idxmax(axis=1)
    cat_values = left.astype(str).str.cat(right.astype(str), sep=';')
    
    choices = non_rounded_values.where(diffs < threshold, 2)
    return np.choose(choices, [left, right, cat_values])


import pandas as pd
import numpy as np
df = pd.DataFrame(
    data=zip(*[
        ['Afghanistan','Bahrain','Djibouti','Egypt','Iran'],
        [2.9,28,2.1,8.5,15.2],
        [np.nan,27.9,np.nan,np.nan,np.nan ],
        [2.9,27.3,1.1,6.5,5.2],
        [2.9,27,2.1,6,np.nan],
        [3,26,1.9,np.nan,np.nan],
    ]),
    columns=['Country', '2019', '2019', '2018', '2018', '2017']
)


to_update = (
    df.loc[:, df.columns.duplicated(keep=False)]
    .groupby(level=0, axis=1).agg(combine, threshold=.5)
)

out = df.loc[:, ~df.columns.duplicated()].copy()
out.update(to_update)

print(out)
       Country  2019     2018  2017
0  Afghanistan   2.9      2.9   3.0
1      Bahrain  27.9     27.3  26.0
2     Djibouti   2.1  1.1;2.1   1.9
3        Egypt   8.5  6.5;6.0   NaN
4         Iran  15.2      5.2   NaN

Update: simplified the code since the column names are exact matches.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM