简体   繁体   English

将字符串的数据框列合并到Pandas中的单个列中

[英]Merging data frame columns of strings into one single column in Pandas

I have columns in a dataframe (imported from a CSV) containing text like this. 我在包含这样的文本的数据框(从CSV导入)中有列。

"New york", "Atlanta", "Mumbai"
"Beijing", "Paris", "Budapest"
"Brussels", "Oslo", "Singapore"

I want to collapse/merge all the columns into one single column, like this 我想将所有列折叠/合并为一个列,就像这样

New york Atlanta
Beijing Paris Budapest
Brussels Oslo Singapore

How to do it in pandas? 如何在熊猫中做到这一点?

Suppose you have a DataFrame like so: 假设你有一个像这样的DataFrame

>>> df
          0        1          2
0  New york  Atlanta     Mumbai
1   Beijing    Paris   Budapest
2  Brussels     Oslo  Singapore

Then, a simple use of the pd.DataFrame.apply method will work nicely: 然后,简单地使用pd.DataFrame.apply方法将很好地工作:

>>> df.apply(" ".join, axis=1)
0    New york Atlanta Mumbai
1     Beijing Paris Budapest
2    Brussels Oslo Singapore
dtype: object

Note, I have to pass axis=1 so that it is applied across the columns, rather than down the rows. 注意,我必须传递axis=1以便它跨列应用,而不是向下行。 Ie: 即:

>>> df.apply(" ".join, axis=0)
0    New york Beijing Brussels
1           Atlanta Paris Oslo
2    Mumbai Budapest Singapore
dtype: object

A faster (but uglier) version is with .cat : 更快(但更丑陋)的版本是.cat

df[0].str.cat(df.ix[:, 1:].T.values, sep=' ')

0    New york Atlanta Mumbai
1     Beijing Paris Budapest
2    Brussels Oslo Singapore
Name: 0, dtype: object

On a larger (10kx5) DataFrame: 在更大的(10kx5)DataFrame上:

%timeit df.apply(" ".join, axis=1)
10 loops, best of 3: 112 ms per loop

%timeit df[0].str.cat(df.ix[:, 1:].T.values, sep=' ')
100 loops, best of 3: 4.48 ms per loop

Here are a couple more ways: 这里有几种方法:

def pir(df):
    df = df.copy()
    df.insert(2, 's', ' ', 1)
    df.insert(1, 's', ' ', 1)
    return df.sum(1)

def pir2(df):
    df = df.copy()
    return pd.MultiIndex.from_arrays(df.values.T).to_series().str.join(' ').reset_index(drop=True)

def pir3(df):
    a = df.values[:, 0].copy()
    for j in range(1, df.shape[1]):
        a += ' ' + df.values[:, j]
    return pd.Series(a)

Timing 定时

pir3 seems fastest over small df pir3似乎比小df

在此输入图像描述

pir3 still fastest over larger df 30,000 rows pir3仍然比30,000行更大的df 更快

在此输入图像描述

If you prefer something more explicit... 如果你更喜欢更明确的东西......

Starting with a dataframe df that looks like this: 从数据框df开始,如下所示:

>>> df
          A         B          C
0  New york   Beijing   Brussels
1   Atlanta     Paris       Oslo
2    Mumbai  Budapest  Singapore

You can create a new column like this: 您可以像这样创建一个新列:

df['result'] = df['A'] + ' ' + df['B'] + ' ' + df['C']

In this case the result is stored in the 'result' column of the original DataFrame: 在这种情况下,结果存储在原始DataFrame的“结果”列中:

          A         B          C                     result
0  New york   Beijing   Brussels  New york Beijing Brussels
1   Atlanta     Paris       Oslo         Atlanta Paris Oslo
2    Mumbai  Budapest  Singapore  Mumbai Budapest Singapore

for the sake of completeness: 为了完整起见:

In [160]: df1.add([' '] * (df1.columns.size - 1) + ['']).sum(axis=1)
Out[160]:
0    New york Atlanta Mumbai
1     Beijing Paris Budapest
2    Brussels Oslo Singapore
dtype: object

Explanation: 说明:

In [162]: [' '] * (df.columns.size - 1) + ['']
Out[162]: [' ', ' ', '']

Timing against 300K rows DF: 针对300K行DF的时序:

In [68]: df = pd.concat([df] * 10**5, ignore_index=True)

In [69]: df.shape
Out[69]: (300000, 3)

In [76]: %timeit df.apply(" ".join, axis=1)
1 loop, best of 3: 5.8 s per loop

In [77]: %timeit df[0].str.cat(df.ix[:, 1:].T.values, sep=' ')
10 loops, best of 3: 138 ms per loop

In [79]: %timeit pir(df)
1 loop, best of 3: 499 ms per loop

In [80]: %timeit pir2(df)
10 loops, best of 3: 174 ms per loop

In [81]: %timeit pir3(df)
10 loops, best of 3: 115 ms per loop

In [159]: %timeit df.add([' '] * (df.columns.size - 1) + ['']).sum(axis=1)
1 loop, best of 3: 478 ms per loop

Conclusion: current winner is @piRSquared's pir3() 结论:目前的赢家是@ piRSquared的pir3()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM