简体   繁体   中英

Merged two dataframe columns with lists in order of lists

I'm trying to merge/concatenate two columns where both have related, but separate text data delimited by "|" in addition to replacing certain names with "" and replace the | with '\\n'.

For example, the original data may be:

    First Names            Last Names
0   Jim|James|Tim          Simth|Jacobs|Turner
1   Mickey|Mini            Mouse|Mouse
2   Mike|Billy|Natasha     Mills|McGill|Tsaka

If I want to merge/concatenate to derive Full Names and remove entries tied to "Smith" the final df should look like:

    First Names            Last Names            Full Names
0   Jim|James|Tim          Simth|Jacobs|Turner   James Jacobs\nTim Turner
1   Mickey|Mini            Mouse|Mouse           Mickey Mouse\nMini Mouse
2   Mike|Billy|Natasha     Mills|McGill|Tsaka    Mike Mills\nBilly McGill\nNatasha Tsaka

My current approach so far has been:

def parse_merge(df, col1, col2, splitter, new_col, list_to_exclude):

    orig_order = pd.Series(list(df.index)).rename('index')

    col1_df = pd.concat([orig_order, df[col1], df[col1].str.split(splitter, expand=True)], axis = 1)
    col2_df = pd.concat([orig_order, df[col2], df[col2].str.split(splitter, expand=True)], axis = 1)

    col1_melt = pd.melt(col1_df, id_vars=['index', col1], var_name='count')
    col2_melt = pd.melt(col2_df, id_vars=['index', col2], var_name='count')

    col2_melt['value'] = '(' + col2_melt['value'].astype(str) + ')'
    col2_melt = col2_melt.rename(columns={'value':'value2'})

    melted_merge = pd.concat([col1_melt, col2_melt['value2']], axis = 1 )

    if len(list_to_exclude) > 0:
         list_map = map(re.escape, list_to_exclude)

    melted_merge.ix[melted_merge['value2'].str.contains('|'.join(list_map)), ['value', 'value2']] = ''

    melted_merge[new_col] = melted_merge['value'] + " " + melted_merge['value2']

if I call:

parse_merge(names, 'First Names', 'Last Names', 'Full Names', ['Smith'])

The data becomes:

    Index   First Names        count    value            value2        Full Names
0   0       Jim|James|Tim      0        Jim              Smith         ''
1   1       Mickey|Mini        0        Mickey           Mouse         Mickey Mouse
2   2       Mike|Billy|Natasha 0        Mike             Mills         Mike Mills

Just not sure how to finish this out without any loops or if there is a more efficient / totally different approach.

Thanks for all the input!

Here is a condensed solution using pd.DataFrame.apply and some of python's nice built-in features:

def combine_names(row):

    pairs = list(zip(row[0].split('|'), row[1].split('|')))
    return '\n'.join([' '.join(p) for p in pairs if p[1] != 'Simth'])

df['Full Name'] = df.apply(combine_names, axis=1)

I really like @AlexG's solution - please use it.

Here is my attempt to create a creative one-liner solution - it's absolutely perverse, so it should NOT be used - it's just for fun:

In [78]: df
Out[78]:
          First Names           Last Names
0       Jim|James|Tim  Simth|Jacobs|Turner
1         Mickey|Mini          Mouse|Mouse
2  Mike|Billy|Natasha   Mills|McGill|Tsaka

In [79]: df['Full Names'] = \
    ...: (df.stack()
    ...:    .str.split(r'\|', expand=True)
    ...:    .unstack(level=1)
    ...:    .groupby(level=0, axis=1)
    ...:    .apply(lambda x: x.add(' ').sum(axis=1).str.strip())
    ...:    .replace([r'\w+\s+Simth'], [np.nan], regex=True)
    ...:    .apply(lambda x: x.dropna().str.cat(sep='\n'), axis=1)
    ...: )
    ...:

In [80]: df
Out[80]:
          First Names           Last Names                               Full Names
0       Jim|James|Tim  Simth|Jacobs|Turner                 James Jacobs\nTim Turner
1         Mickey|Mini          Mouse|Mouse                 Mickey Mouse\nMini Mouse
2  Mike|Billy|Natasha   Mills|McGill|Tsaka  Mike Mills\nBilly McGill\nNatasha Tsaka

I've got a lot of comprehension

l = df.values.tolist()

['|'.join(n)
 for n in [[' '.join(z)
 for z in zip(*[s.split('|')
 for s in r]) if z[1] != 'Smith']
 for r in l]]

['James Jacobs|Tim Turner',
 'Mickey Mouse|Mini Mouse',
 'Mike Mills|Billy McGill|Natasha Tsaka']

l = df.values.tolist()

df['Full Names'] = [
     '|'.join(n)
     for n in [[' '.join(z)
     for z in zip(*[s.split('|')
     for s in r]) if z[1] != 'Smith']
     for r in l]]

df

在此输入图像描述


word play aside, this is pretty snappy over sample data

在此输入图像描述


longer explanation

l

[['Jim|James|Tim', 'Simth|Jacobs|Turner'],
 ['Mickey|Mini', 'Mouse|Mouse'],
 ['Mike|Billy|Natasha', 'Mills|McGill|Tsaka']]
  • l is a list of lists. I will make extensive use of list comprehensions and iterables.
  • Each sub-list consists of 2 strings that I will split and zip together.
  • The result of the split will be a "list" of tuples consisting of (first, last) names. I'll use if z[1] != 'Smith' to filter out the smiths.
    • BTW, in this line you could use z[1] not in list_of_names
  • I'll then use ' '.join (that's actually a function) to combine each tuple to first last
  • I'll then use another '|'.join to combine the sub-list of first last to first1 last1|first2 last2 ... so on and so forth

The reason why this is quicker is because comprehensions have been optimized to a great extent. The other solutions are using apply which is a generic looping structure that can only leverage fast looping under special circumstances (someone who knows more, please correct me if I'm wrong). Using lambda is definitely not one of those circumstnces.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM