简体   繁体   English

合并两个数据帧列,列表按列表顺序排列

[英]Merged two dataframe columns with lists in order of lists

I'm trying to merge/concatenate two columns where both have related, but separate text data delimited by "|" 我正在尝试合并/连接两个列相关的列,但是用“|”分隔的单独文本数据 in addition to replacing certain names with "" and replace the | 除了用“”替换某些名称并替换| with '\\n'. 用'\\ n'。

For example, the original data may be: 例如,原始数据可能是:

    First Names            Last Names
0   Jim|James|Tim          Simth|Jacobs|Turner
1   Mickey|Mini            Mouse|Mouse
2   Mike|Billy|Natasha     Mills|McGill|Tsaka

If I want to merge/concatenate to derive Full Names and remove entries tied to "Smith" the final df should look like: 如果我想合并/连接以获取全名并删除与“Smith”绑定的条目,则最终的df应如下所示:

    First Names            Last Names            Full Names
0   Jim|James|Tim          Simth|Jacobs|Turner   James Jacobs\nTim Turner
1   Mickey|Mini            Mouse|Mouse           Mickey Mouse\nMini Mouse
2   Mike|Billy|Natasha     Mills|McGill|Tsaka    Mike Mills\nBilly McGill\nNatasha Tsaka

My current approach so far has been: 到目前为止我目前的做法是:

def parse_merge(df, col1, col2, splitter, new_col, list_to_exclude):

    orig_order = pd.Series(list(df.index)).rename('index')

    col1_df = pd.concat([orig_order, df[col1], df[col1].str.split(splitter, expand=True)], axis = 1)
    col2_df = pd.concat([orig_order, df[col2], df[col2].str.split(splitter, expand=True)], axis = 1)

    col1_melt = pd.melt(col1_df, id_vars=['index', col1], var_name='count')
    col2_melt = pd.melt(col2_df, id_vars=['index', col2], var_name='count')

    col2_melt['value'] = '(' + col2_melt['value'].astype(str) + ')'
    col2_melt = col2_melt.rename(columns={'value':'value2'})

    melted_merge = pd.concat([col1_melt, col2_melt['value2']], axis = 1 )

    if len(list_to_exclude) > 0:
         list_map = map(re.escape, list_to_exclude)

    melted_merge.ix[melted_merge['value2'].str.contains('|'.join(list_map)), ['value', 'value2']] = ''

    melted_merge[new_col] = melted_merge['value'] + " " + melted_merge['value2']

if I call: 如果我打电话:

parse_merge(names, 'First Names', 'Last Names', 'Full Names', ['Smith'])

The data becomes: 数据变为:

    Index   First Names        count    value            value2        Full Names
0   0       Jim|James|Tim      0        Jim              Smith         ''
1   1       Mickey|Mini        0        Mickey           Mouse         Mickey Mouse
2   2       Mike|Billy|Natasha 0        Mike             Mills         Mike Mills

Just not sure how to finish this out without any loops or if there is a more efficient / totally different approach. 只是不确定如何在没有任何循环的情况下完成此操作,或者是否有更有效/完全不同的方法。

Thanks for all the input! 感谢所有的投入!

Here is a condensed solution using pd.DataFrame.apply and some of python's nice built-in features: 这是一个使用pd.DataFrame.apply和python的一些很好的内置功能的精简解决方案:

def combine_names(row):

    pairs = list(zip(row[0].split('|'), row[1].split('|')))
    return '\n'.join([' '.join(p) for p in pairs if p[1] != 'Simth'])

df['Full Name'] = df.apply(combine_names, axis=1)

I really like @AlexG's solution - please use it. 我非常喜欢@AlexG的解决方案 - 请使用它。

Here is my attempt to create a creative one-liner solution - it's absolutely perverse, so it should NOT be used - it's just for fun: 这是我尝试创建一个创造性的单行解决方案 - 它绝对有悖常理,所以不应该使用它 - 它只是为了好玩:

In [78]: df
Out[78]:
          First Names           Last Names
0       Jim|James|Tim  Simth|Jacobs|Turner
1         Mickey|Mini          Mouse|Mouse
2  Mike|Billy|Natasha   Mills|McGill|Tsaka

In [79]: df['Full Names'] = \
    ...: (df.stack()
    ...:    .str.split(r'\|', expand=True)
    ...:    .unstack(level=1)
    ...:    .groupby(level=0, axis=1)
    ...:    .apply(lambda x: x.add(' ').sum(axis=1).str.strip())
    ...:    .replace([r'\w+\s+Simth'], [np.nan], regex=True)
    ...:    .apply(lambda x: x.dropna().str.cat(sep='\n'), axis=1)
    ...: )
    ...:

In [80]: df
Out[80]:
          First Names           Last Names                               Full Names
0       Jim|James|Tim  Simth|Jacobs|Turner                 James Jacobs\nTim Turner
1         Mickey|Mini          Mouse|Mouse                 Mickey Mouse\nMini Mouse
2  Mike|Billy|Natasha   Mills|McGill|Tsaka  Mike Mills\nBilly McGill\nNatasha Tsaka

I've got a lot of comprehension 我有很多理解力

l = df.values.tolist()

['|'.join(n)
 for n in [[' '.join(z)
 for z in zip(*[s.split('|')
 for s in r]) if z[1] != 'Smith']
 for r in l]]

['James Jacobs|Tim Turner',
 'Mickey Mouse|Mini Mouse',
 'Mike Mills|Billy McGill|Natasha Tsaka']

l = df.values.tolist()

df['Full Names'] = [
     '|'.join(n)
     for n in [[' '.join(z)
     for z in zip(*[s.split('|')
     for s in r]) if z[1] != 'Smith']
     for r in l]]

df

在此输入图像描述


word play aside, this is pretty snappy over sample data 除了文字游戏,这对样本数据非常敏感

在此输入图像描述


longer explanation 更长的解释

l

[['Jim|James|Tim', 'Simth|Jacobs|Turner'],
 ['Mickey|Mini', 'Mouse|Mouse'],
 ['Mike|Billy|Natasha', 'Mills|McGill|Tsaka']]
  • l is a list of lists. l是列表清单。 I will make extensive use of list comprehensions and iterables. 我将广泛使用列表推导和迭代。
  • Each sub-list consists of 2 strings that I will split and zip together. 每个子列表由2个字符串组成,我将拆分并压缩在一起。
  • The result of the split will be a "list" of tuples consisting of (first, last) names. 拆分的结果将是由(first, last)名称组成的元组的“列表”。 I'll use if z[1] != 'Smith' to filter out the smiths. 我会用if z[1] != 'Smith'过滤出史密斯。
    • BTW, in this line you could use z[1] not in list_of_names 顺便说一句,在这一行你可以使用z[1] not in list_of_names
  • I'll then use ' '.join (that's actually a function) to combine each tuple to first last 然后我会使用' '.join (实际上是一个函数)将每个元组组合到first last一个元组
  • I'll then use another '|'.join to combine the sub-list of first last to first1 last1|first2 last2 ... so on and so forth 然后我将使用另一个'|'.joinfirst last的子列表与first1 last1|first2 last2 ...等等结合起来等等

The reason why this is quicker is because comprehensions have been optimized to a great extent. 这更快的原因是因为理解已在很大程度上得到优化。 The other solutions are using apply which is a generic looping structure that can only leverage fast looping under special circumstances (someone who knows more, please correct me if I'm wrong). 其他解决方案正在使用apply ,这是一种通用的循环结构,只能在特殊情况下利用快速循环(有人知道更多,如果我错了请纠正我)。 Using lambda is definitely not one of those circumstnces. 使用lambda绝对不是那种情况之一。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM