[英]How to combine multiple lists of string columns in python?
我有一个Python Pandas数据框。
我尝试创建一个新列total_str
,它是colA
和colB
中的值的列表。
这是预期的输出:
colA colB total_str
0 ['a','b','c'] ['a','b','c'] ['a','b','c','a','b','c']
1 ['a','b','c'] nan ['a','b','c']
2 ['a','b','c'] ['d','e'] ['a','b','c','d','e']
#replace nan with empty list and then concatenate colA and colB using sum.
df['total_str'] = df.applymap(lambda x: [] if x is np.nan else x).apply(lambda x: sum(x,[]), axis=1)
df
Out[705]:
colA colB total_str
0 [a, b, c] [a, b, c] [a, b, c, a, b, c]
1 [a, b, c] NaN [a, b, c]
2 [a, b, c] [d, e] [a, b, c, d, e]
如果DF中还有其他列,则可以使用:
df['total_str'] = df.applymap(lambda x: [] if x is np.nan else x).apply(lambda x: x.colA+x.colB, axis=1)
chain
为您做这个技巧。
itertools.chain(*filter(bool, [colA, colB]))
这将返回一个迭代器,如果需要,您可以使用list
结果来获取列表,例如
import itertools
def test(colA, colB):
total_str = itertools.chain(*filter(bool, [colA, colB]))
print list(total_str)
test(['a', 'b'], ['c']) # output: ['a', 'b', 'c']
test(['a', 'b', 'd'], None) # output: ['a', 'b', 'c']
test(['a', 'b', 'd'], ['x', 'y', 'z']) # ['a', 'b', 'd', 'x', 'y', 'z']
test(None, None) # output []
我假设您要在数据numpy.nan
处理numpy.nan
和None
。 您可以简单地编写一个辅助函数,以在创建新列时将它们替换为空列表。 这不是干净的,但可以。
def helper(x):
return x if x is not np.nan and x is not None else []
dataframe['total_str'] = dataframe['colA'].map(helper) + dataframe['colB'].map(helper)
使用combine_first
将NaN
替换为空list
以实现更快的解决方案:
df['total_str'] = df['colA'] + df['colB'].combine_first(pd.Series([[]], index=df.index))
print (df)
colA colB total_str
0 [a, b, c] [a, b, c] [a, b, c, a, b, c]
1 [a, b, c] NaN [a, b, c]
2 [a, b, c] [d, e] [a, b, c, d, e]
df['total_str'] = df['colA'].add(df['colB'].combine_first(pd.Series([[]], index=df.index)))
print (df)
colA colB total_str
0 [a, b, c] [a, b, c] [a, b, c, a, b, c]
1 [a, b, c] NaN [a, b, c]
2 [a, b, c] [d, e] [a, b, c, d, e]
时间 :
df = pd.DataFrame({'colA': [['a','b','c']] * 3, 'colB':[['a','b','c'], np.nan, ['d','e']]})
#[30000 rows x 2 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)
In [62]: %timeit df['total_str'] = df['colA'].combine_first(pd.Series([[]], index=df.index)) + df['colB'].combine_first(pd.Series([[]], index=df.index))
100 loops, best of 3: 8.1 ms per loop
In [63]: %timeit df['total_str1'] = df['colA'].fillna(pd.Series([[]], index=df.index)) + df['colB'].fillna(pd.Series([[]], index=df.index))
100 loops, best of 3: 9.1 ms per loop
In [64]: %timeit df['total_str2'] = df.applymap(lambda x: [] if x is np.nan else x).apply(lambda x: x.colA+x.colB, axis=1)
1 loop, best of 3: 960 ms per loop
您可以像这样在熊猫中添加列:
dataframe['total_str'] = dataframe['colA'] + dataframe['colB']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.