简体   繁体   中英

merge several columns into one

I have a data frame like this (im trying to adapt it, since its in spanish and copy paste doesnt help)

     Question 1 opt. A  Question 1 opt. B  Question 1 opt. C  Question 2 opt. A    Question 2 opt. B  
 0     NaN                    NaN                 yes              NaN                 NaN
 1     NaN                    None                NaN              Uber                NaN
 2     NaN                    NaN                 NaN              Didi                NaN

So, many columns are really an answer to the same question, only different option. What I would like to do is some kind of merge like this:

    Question 1    Question 2    
 0     yes            NaN                  
 1     None           Uber                  
 2     NaN            Didi                 

That is, to somehow summarize all the answers for each question into a single column (provided all are mutually exclusive). Tagging each one would be a plus. I believe a for loop could do it, but Im very bad at implementing it, and loops are strongly advised to not be used in python.

You can use str.extract to extract the question part from the columns then groupby the dataframe on this extracted series along axis=1 and aggregate using first :

g = df.columns.str.extract(r'(Question \d+)', expand=False)
out = df.groupby(g, axis=1).first()

Result:

  Question 1 Question 2
0        yes        NaN
1       None       Uber
2        NaN       Didi

Try this:

(pd.wide_to_long(df.reset_index(), ['Question 1', 'Question 2'], 'index', 'Option', sep=' ', suffix='.*')\
  .dropna(how='all')
  .max(level=1)
  .reset_index())

Output:

   Option Question 1 Question 2
0  opt. C        yes        NaN
1  opt. A        NaN       Uber
2  opt. B       None        NaN

Use fillna to replace None and NaN with emtpy string. Then rest is simple concatenation

Code:

import pandas as pd
import numpy as np

data = {'Question 1 opt. A' : [np.nan, np.nan, np.nan],
        'Question 1 opt. B' : [np.nan, None, np.nan],
        'Question 1 opt. C' : ['yes', np.nan, np.nan],
        'Question 2 opt. A' : [np.nan, 'Uber','Didi'],
        'Question 2 opt. B' : [np.nan, np.nan, np.nan]}
        
df = pd.DataFrame(data)
print(df)
df.fillna('', inplace=True)
df['Question 1'] = df['Question 1 opt. A'] + df['Question 1 opt. B'] + df['Question 1 opt. C']
df['Question 2'] =  df['Question 2 opt. A'] + df['Question 2 opt. B']
print(df)

Output:

   Question 1 opt. A  Question 1 opt. B Question 1 opt. C Question 2 opt. A  Question 2 opt. B
0                NaN                NaN               yes               NaN                NaN
1                NaN                NaN               NaN              Uber                NaN
2                NaN                NaN               NaN              Didi                NaN
  Question 1 opt. A Question 1 opt. B Question 1 opt. C Question 2 opt. A Question 2 opt. B Question 1 Question 2
0                                                   yes                                            yes
1                                                                    Uber                                    Uber
2                                                                    Didi                                    Didi

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM