[英]Merging data frame columns of strings into one single column in Pandas
I have columns in a dataframe (imported from a CSV) containing text like this. 我在包含这样的文本的数据框(从CSV导入)中有列。
"New york", "Atlanta", "Mumbai"
"Beijing", "Paris", "Budapest"
"Brussels", "Oslo", "Singapore"
I want to collapse/merge all the columns into one single column, like this 我想将所有列折叠/合并为一个列,就像这样
New york Atlanta
Beijing Paris Budapest
Brussels Oslo Singapore
How to do it in pandas? 如何在熊猫中做到这一点?
Suppose you have a DataFrame
like so: 假设你有一个像这样的DataFrame
:
>>> df
0 1 2
0 New york Atlanta Mumbai
1 Beijing Paris Budapest
2 Brussels Oslo Singapore
Then, a simple use of the pd.DataFrame.apply
method will work nicely: 然后,简单地使用pd.DataFrame.apply
方法将很好地工作:
>>> df.apply(" ".join, axis=1)
0 New york Atlanta Mumbai
1 Beijing Paris Budapest
2 Brussels Oslo Singapore
dtype: object
Note, I have to pass axis=1
so that it is applied across the columns, rather than down the rows. 注意,我必须传递axis=1
以便它跨列应用,而不是向下行。 Ie: 即:
>>> df.apply(" ".join, axis=0)
0 New york Beijing Brussels
1 Atlanta Paris Oslo
2 Mumbai Budapest Singapore
dtype: object
A faster (but uglier) version is with .cat
: 更快(但更丑陋)的版本是.cat
:
df[0].str.cat(df.ix[:, 1:].T.values, sep=' ')
0 New york Atlanta Mumbai
1 Beijing Paris Budapest
2 Brussels Oslo Singapore
Name: 0, dtype: object
On a larger (10kx5) DataFrame: 在更大的(10kx5)DataFrame上:
%timeit df.apply(" ".join, axis=1)
10 loops, best of 3: 112 ms per loop
%timeit df[0].str.cat(df.ix[:, 1:].T.values, sep=' ')
100 loops, best of 3: 4.48 ms per loop
Here are a couple more ways: 这里有几种方法:
def pir(df):
df = df.copy()
df.insert(2, 's', ' ', 1)
df.insert(1, 's', ' ', 1)
return df.sum(1)
def pir2(df):
df = df.copy()
return pd.MultiIndex.from_arrays(df.values.T).to_series().str.join(' ').reset_index(drop=True)
def pir3(df):
a = df.values[:, 0].copy()
for j in range(1, df.shape[1]):
a += ' ' + df.values[:, j]
return pd.Series(a)
pir3 seems fastest over small df
pir3似乎比小df
快
pir3 still fastest over larger df
30,000 rows pir3仍然比30,000行更大的df
更快
If you prefer something more explicit... 如果你更喜欢更明确的东西......
Starting with a dataframe df that looks like this: 从数据框df开始,如下所示:
>>> df
A B C
0 New york Beijing Brussels
1 Atlanta Paris Oslo
2 Mumbai Budapest Singapore
You can create a new column like this: 您可以像这样创建一个新列:
df['result'] = df['A'] + ' ' + df['B'] + ' ' + df['C']
In this case the result is stored in the 'result' column of the original DataFrame: 在这种情况下,结果存储在原始DataFrame的“结果”列中:
A B C result
0 New york Beijing Brussels New york Beijing Brussels
1 Atlanta Paris Oslo Atlanta Paris Oslo
2 Mumbai Budapest Singapore Mumbai Budapest Singapore
for the sake of completeness: 为了完整起见:
In [160]: df1.add([' '] * (df1.columns.size - 1) + ['']).sum(axis=1)
Out[160]:
0 New york Atlanta Mumbai
1 Beijing Paris Budapest
2 Brussels Oslo Singapore
dtype: object
Explanation: 说明:
In [162]: [' '] * (df.columns.size - 1) + ['']
Out[162]: [' ', ' ', '']
Timing against 300K rows DF: 针对300K行DF的时序:
In [68]: df = pd.concat([df] * 10**5, ignore_index=True)
In [69]: df.shape
Out[69]: (300000, 3)
In [76]: %timeit df.apply(" ".join, axis=1)
1 loop, best of 3: 5.8 s per loop
In [77]: %timeit df[0].str.cat(df.ix[:, 1:].T.values, sep=' ')
10 loops, best of 3: 138 ms per loop
In [79]: %timeit pir(df)
1 loop, best of 3: 499 ms per loop
In [80]: %timeit pir2(df)
10 loops, best of 3: 174 ms per loop
In [81]: %timeit pir3(df)
10 loops, best of 3: 115 ms per loop
In [159]: %timeit df.add([' '] * (df.columns.size - 1) + ['']).sum(axis=1)
1 loop, best of 3: 478 ms per loop
Conclusion: current winner is @piRSquared's pir3() 结论:目前的赢家是@ piRSquared的pir3()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.