[英]Merged two dataframe columns with lists in order of lists
I'm trying to merge/concatenate two columns where both have related, but separate text data delimited by "|" 我正在尝试合并/连接两个列相关的列,但是用“|”分隔的单独文本数据 in addition to replacing certain names with "" and replace the |
除了用“”替换某些名称并替换| with '\\n'.
用'\\ n'。
For example, the original data may be: 例如,原始数据可能是:
First Names Last Names
0 Jim|James|Tim Simth|Jacobs|Turner
1 Mickey|Mini Mouse|Mouse
2 Mike|Billy|Natasha Mills|McGill|Tsaka
If I want to merge/concatenate to derive Full Names and remove entries tied to "Smith" the final df should look like: 如果我想合并/连接以获取全名并删除与“Smith”绑定的条目,则最终的df应如下所示:
First Names Last Names Full Names
0 Jim|James|Tim Simth|Jacobs|Turner James Jacobs\nTim Turner
1 Mickey|Mini Mouse|Mouse Mickey Mouse\nMini Mouse
2 Mike|Billy|Natasha Mills|McGill|Tsaka Mike Mills\nBilly McGill\nNatasha Tsaka
My current approach so far has been: 到目前为止我目前的做法是:
def parse_merge(df, col1, col2, splitter, new_col, list_to_exclude):
orig_order = pd.Series(list(df.index)).rename('index')
col1_df = pd.concat([orig_order, df[col1], df[col1].str.split(splitter, expand=True)], axis = 1)
col2_df = pd.concat([orig_order, df[col2], df[col2].str.split(splitter, expand=True)], axis = 1)
col1_melt = pd.melt(col1_df, id_vars=['index', col1], var_name='count')
col2_melt = pd.melt(col2_df, id_vars=['index', col2], var_name='count')
col2_melt['value'] = '(' + col2_melt['value'].astype(str) + ')'
col2_melt = col2_melt.rename(columns={'value':'value2'})
melted_merge = pd.concat([col1_melt, col2_melt['value2']], axis = 1 )
if len(list_to_exclude) > 0:
list_map = map(re.escape, list_to_exclude)
melted_merge.ix[melted_merge['value2'].str.contains('|'.join(list_map)), ['value', 'value2']] = ''
melted_merge[new_col] = melted_merge['value'] + " " + melted_merge['value2']
if I call: 如果我打电话:
parse_merge(names, 'First Names', 'Last Names', 'Full Names', ['Smith'])
The data becomes: 数据变为:
Index First Names count value value2 Full Names
0 0 Jim|James|Tim 0 Jim Smith ''
1 1 Mickey|Mini 0 Mickey Mouse Mickey Mouse
2 2 Mike|Billy|Natasha 0 Mike Mills Mike Mills
Just not sure how to finish this out without any loops or if there is a more efficient / totally different approach. 只是不确定如何在没有任何循环的情况下完成此操作,或者是否有更有效/完全不同的方法。
Thanks for all the input! 感谢所有的投入!
Here is a condensed solution using pd.DataFrame.apply
and some of python's nice built-in features: 这是一个使用
pd.DataFrame.apply
和python的一些很好的内置功能的精简解决方案:
def combine_names(row):
pairs = list(zip(row[0].split('|'), row[1].split('|')))
return '\n'.join([' '.join(p) for p in pairs if p[1] != 'Simth'])
df['Full Name'] = df.apply(combine_names, axis=1)
I really like @AlexG's solution - please use it. 我非常喜欢@AlexG的解决方案 - 请使用它。
Here is my attempt to create a creative one-liner solution - it's absolutely perverse, so it should NOT be used - it's just for fun: 这是我尝试创建一个创造性的单行解决方案 - 它绝对有悖常理,所以不应该使用它 - 它只是为了好玩:
In [78]: df
Out[78]:
First Names Last Names
0 Jim|James|Tim Simth|Jacobs|Turner
1 Mickey|Mini Mouse|Mouse
2 Mike|Billy|Natasha Mills|McGill|Tsaka
In [79]: df['Full Names'] = \
...: (df.stack()
...: .str.split(r'\|', expand=True)
...: .unstack(level=1)
...: .groupby(level=0, axis=1)
...: .apply(lambda x: x.add(' ').sum(axis=1).str.strip())
...: .replace([r'\w+\s+Simth'], [np.nan], regex=True)
...: .apply(lambda x: x.dropna().str.cat(sep='\n'), axis=1)
...: )
...:
In [80]: df
Out[80]:
First Names Last Names Full Names
0 Jim|James|Tim Simth|Jacobs|Turner James Jacobs\nTim Turner
1 Mickey|Mini Mouse|Mouse Mickey Mouse\nMini Mouse
2 Mike|Billy|Natasha Mills|McGill|Tsaka Mike Mills\nBilly McGill\nNatasha Tsaka
I've got a lot of comprehension 我有很多理解力
l = df.values.tolist()
['|'.join(n)
for n in [[' '.join(z)
for z in zip(*[s.split('|')
for s in r]) if z[1] != 'Smith']
for r in l]]
['James Jacobs|Tim Turner',
'Mickey Mouse|Mini Mouse',
'Mike Mills|Billy McGill|Natasha Tsaka']
l = df.values.tolist()
df['Full Names'] = [
'|'.join(n)
for n in [[' '.join(z)
for z in zip(*[s.split('|')
for s in r]) if z[1] != 'Smith']
for r in l]]
df
word play aside, this is pretty snappy over sample data 除了文字游戏,这对样本数据非常敏感
longer explanation 更长的解释
l
[['Jim|James|Tim', 'Simth|Jacobs|Turner'],
['Mickey|Mini', 'Mouse|Mouse'],
['Mike|Billy|Natasha', 'Mills|McGill|Tsaka']]
l
is a list of lists. l
是列表清单。 I will make extensive use of list comprehensions and iterables. (first, last)
names. (first, last)
名称组成的元组的“列表”。 I'll use if z[1] != 'Smith'
to filter out the smiths. if z[1] != 'Smith'
过滤出史密斯。
z[1] not in list_of_names
z[1] not in list_of_names
' '.join
(that's actually a function) to combine each tuple to first last
' '.join
(实际上是一个函数)将每个元组组合到first last
一个元组 '|'.join
to combine the sub-list of first last
to first1 last1|first2 last2
... so on and so forth '|'.join
将first last
的子列表与first1 last1|first2 last2
...等等结合起来等等 The reason why this is quicker is because comprehensions have been optimized to a great extent. 这更快的原因是因为理解已在很大程度上得到优化。 The other solutions are using
apply
which is a generic looping structure that can only leverage fast looping under special circumstances (someone who knows more, please correct me if I'm wrong). 其他解决方案正在使用
apply
,这是一种通用的循环结构,只能在特殊情况下利用快速循环(有人知道更多,如果我错了请纠正我)。 Using lambda
is definitely not one of those circumstnces. 使用
lambda
绝对不是那种情况之一。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.