合并pandas DataFrame列以相同的字母开头

Question

Let's say I have a DataFrame : 假设我有一个DataFrame ：

>>> df = pd.DataFrame({'a1':[1,2],'a2':[3,4],'b1':[5,6],'b2':[7,8],'c':[9,0]})
>>> df
   a1  a2  b1  b2  c
0   1   3   5   7  9
1   2   4   6   8  0
>>>

And I want to merge (maybe not merge, but concatenate) the columns where their name's first letter are equal, such as a1 and a2 and others... but as we see, there is a c column which is by itself without any other similar ones, therefore I want them to not throw errors, instead add NaN s to them. 我想合并（可能不合并，但连接）他们的名字的第一个字母相等的列，例如a1和a2等等......但正如我们所见，有一个c列本身没有任何其他类似的，因此我希望他们不要抛出错误，而是向他们添加NaN 。

I want to merge in a way that it will change a wide DataFrame into a long DataFrame , basically like a wide to long modification. 我希望以一种方式合并它将一个宽的DataFrame更改为一个长的DataFrame ，基本上就像一个从长到大的修改。

I already have a solution to the problem, but only thing is that it's very inefficient, I would like a more efficient and faster solution (unlike mine :P), I currently have a for loop and a try except (ugh, sounds bad already) code such as: 我已经有了问题的解决方案，但唯一的问题是效率非常低，我想要一个更高效，更快速的解决方案（不像我的：P），我目前有一个for循环和一个try except （呃，听起来很糟糕））代码如：

>>> df2 = pd.DataFrame()
>>> for i in df.columns.str[:1].unique():
    try:
        df2[i] = df[[x for x in df.columns if x[:1] == i]].values.flatten()
    except:
        l = df[[x for x in df.columns if x[:1] == i]].values.flatten().tolist()
        df2[i] = l + [pd.np.nan] * (len(df2) - len(l))


>>> df2
   a  b    c
0  1  5  9.0
1  3  7  0.0
2  2  6  NaN
3  4  8  NaN
>>>

I would like to obtain the same results with better code. 我希望用更好的代码获得相同的结果。

Answer 1

I'd recommend melt , followed by pivot . 我建议melt ，然后pivot 。 To resolve duplicates, you'll need to pivot on a cumcounted column. 要解决重复项，您需要在cumcounted列上进行透视。

u = df.melt()
u['variable'] = u['variable'].str[0]  # extract the first letter
u.assign(count=u.groupby('variable').cumcount()).pivot('count', 'variable', 'value')

variable    a    b    c
count                  
0         1.0  5.0  9.0
1         2.0  6.0  0.0
2         3.0  7.0  NaN
3         4.0  8.0  NaN

This can be re-written as, 这可以重写为，

u = df.melt()
u['variable'] = [x[0] for x in u['variable']]
u.insert(0, 'count', u.groupby('variable').cumcount())

u.pivot(*u)

variable    a    b    c
count                  
0         1.0  5.0  9.0
1         2.0  6.0  0.0
2         3.0  7.0  NaN
3         4.0  8.0  NaN

If performance matters, here's an alternative with pd.concat : 如果性能很重要，这里有pd.concat的替代方案：

from operator import itemgetter

pd.concat({
    k: pd.Series(g.values.ravel()) 
    for k, g in df.groupby(operator.itemgetter(0), axis=1)
}, axis=1)

   a  b    c
0  1  5  9.0
1  3  7  0.0
2  2  6  NaN
3  4  8  NaN

Answer 2

We can try groupby columns ( axis=1 ): 我们可以尝试groupby列（ axis=1 ）：

def f(g,a):
    ret = g.stack().reset_index(drop=True)
    ret.name = a
    return ret

pd.concat( (f(g,a) for a,g in df.groupby(df.columns.str[0], axis=1)), axis=1)

output: 输出：

    a   b   c
0   1   5   9.0
1   3   7   0.0
2   2   6   NaN
3   4   8   NaN

Answer 3

Use dictionary comprehension : 使用字典理解：

df = pd.DataFrame({i: pd.Series(x.to_numpy().ravel()) 
                      for i, x in df.groupby(lambda x: x[0], axis=1)})
print (df)
   a  b    c
0  1  5  9.0
1  3  7  0.0
2  2  6  NaN
3  4  8  NaN

Answer 4

I know this is not as good as using melt , but since this push into one-line, if you do need a faster solution try cs95's solution 我知道这不如使用融合那么好，但是由于这种推入单行，如果你需要更快的解决方案，请尝试cs95的解决方案

df.groupby(df.columns.str[0],1).agg(lambda x : x.tolist()).sum().apply(pd.Series).T
Out[391]: 
     a    b    c
0  1.0  5.0  9.0
1  3.0  7.0  0.0
2  2.0  6.0  NaN
3  4.0  8.0  NaN

Answer 5

Using rename and groupby.apply : 使用rename和groupby.apply ：

df = (df.rename(columns = dict(zip(df.columns, df.columns.str[:1])))
        .groupby(level=0, axis=1, group_keys=False)
        .apply(lambda x: pd.DataFrame(x.values.flat, columns=np.unique(x.columns))))

print(df)
   a  b    c
0  1  5  9.0
1  3  7  0.0
2  2  6  NaN
3  4  8  NaN

Answer 6

Using pd.concat with pd.melt and pd.groupby : 将pd.concat与pd.melt和pd.groupby ：

pd.concat([d.T.melt(value_name=k)[k] for k, d in df.groupby(df.columns.str[0], 1)], 1)

Output: 输出：

   a  b    c
0  1  5  9.0
1  3  7  0.0
2  2  6  NaN
3  4  8  NaN

Answer 7

This solution gives a similar answer to cs95's and is two to three times faster. 这个解决方案给出了cs95的类似答案，速度提高了两到三倍。

grouping = df.columns.map(lambda s: int(s[1:]) if len(s) > 1 else 1)
df.columns = df.columns.str[0]   # Make a copy if the original dataframe needs to be retained
result = pd.concat((g for _, g in df.groupby(grouping, axis=1)), 
                   axis=0, ignore_index=True, sort=False)

Output 产量

    a   b   c
0   1   5   9.0
1   2   6   0.0
2   3   7   NaN
3   4   8   NaN

合并pandas DataFrame列以相同的字母开头

问题描述

7 个解决方案

解决方案1
4 2019-06-07 03:43:04

解决方案2
3 2019-06-07 03:49:35

解决方案3
3 已采纳 2019-06-07 06:13:49

解决方案4
2 2019-06-07 03:52:37

解决方案5
1 2019-06-07 04:08:29

解决方案6
1 2019-06-07 04:09:22

解决方案7
1 2019-06-07 06:02:10

合并pandas DataFrame列以相同的字母开头

问题描述

7 个解决方案

解决方案1 4 2019-06-07 03:43:04

解决方案2 3 2019-06-07 03:49:35

解决方案3 3 已采纳 2019-06-07 06:13:49

解决方案4 2 2019-06-07 03:52:37

解决方案5 1 2019-06-07 04:08:29

解决方案6 1 2019-06-07 04:09:22

解决方案7 1 2019-06-07 06:02:10

解决方案1
4 2019-06-07 03:43:04

解决方案2
3 2019-06-07 03:49:35

解决方案3
3 已采纳 2019-06-07 06:13:49

解决方案4
2 2019-06-07 03:52:37

解决方案5
1 2019-06-07 04:08:29

解决方案6
1 2019-06-07 04:09:22

解决方案7
1 2019-06-07 06:02:10