I have a CSV file with multiple columns having the same name. I want to merge the values and keep only the unique columns as outputs with certain rules to choose between two conflicting values. And if the values are the same, just select one. This is how my CSV would look like. (PS My CSV doesn't have headings separated with an underscore. For sake of creating dataframe, I have used underscore.)
df_data_hworkforce = pd.DataFrame({"Country": ['Afghanistan','Bahrain','Djibouti','Egypt','Iran'],
"2019": [2.9,28,2.1,8.5,15.2],
"2019_1": [np.nan,27.9,np.nan,np.nan,np.nan ],
"2018": [2.9,27.3,1.1,6.5,5.2],
"2018_1": [2.9,27,2.1,6,np.nan],
"2017": [3,26,1.9,np.nan,np.nan],
})
Directly creating same name dataframe was not possible. So doing this to present an example.
df_data_hworkforce.rename(columns = {'2019_1':'2019','2018_1':'2018'},inplace = True)
This is how dataframe looks like
Joining the columns with same name the following way:
def sjoin(x): return ';'.join(x[x.notnull()].astype(str))
df_data_hworkforce.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))
This combines the value of two columns and gives the following results.
However, my desired output is to select only one data when the data is same in both columns and if they are different by less than 0.5, select the not rounded off value. Below is my desired output.
This is a very peculiar data transformation and can not be implemented very efficiently.
However an approach you can take is:
groupby
each pairing of data values agg
regate according to your desired threshold & transformation def combine(df, threshold=.5):
df = df.set_axis([0, 1], axis=1)
left, right = df.iloc[:, 0], df.iloc[:, 1]
left = left.fillna(right)
right = right.fillna(left)
diffs = (left - right).abs()
non_rounded_values = df.round().ne(df).idxmax(axis=1)
cat_values = left.astype(str).str.cat(right.astype(str), sep=';')
choices = non_rounded_values.where(diffs < threshold, 2)
return np.choose(choices, [left, right, cat_values])
import pandas as pd
import numpy as np
df = pd.DataFrame(
data=zip(*[
['Afghanistan','Bahrain','Djibouti','Egypt','Iran'],
[2.9,28,2.1,8.5,15.2],
[np.nan,27.9,np.nan,np.nan,np.nan ],
[2.9,27.3,1.1,6.5,5.2],
[2.9,27,2.1,6,np.nan],
[3,26,1.9,np.nan,np.nan],
]),
columns=['Country', '2019', '2019', '2018', '2018', '2017']
)
to_update = (
df.loc[:, df.columns.duplicated(keep=False)]
.groupby(level=0, axis=1).agg(combine, threshold=.5)
)
out = df.loc[:, ~df.columns.duplicated()].copy()
out.update(to_update)
print(out)
Country 2019 2018 2017
0 Afghanistan 2.9 2.9 3.0
1 Bahrain 27.9 27.3 26.0
2 Djibouti 2.1 1.1;2.1 1.9
3 Egypt 8.5 6.5;6.0 NaN
4 Iran 15.2 5.2 NaN
Update: simplified the code since the column names are exact matches.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.