I'm trying to merge/concatenate two columns where both have related, but separate text data delimited by "|" in addition to replacing certain names with "" and replace the | with '\\n'.
For example, the original data may be:
First Names Last Names
0 Jim|James|Tim Simth|Jacobs|Turner
1 Mickey|Mini Mouse|Mouse
2 Mike|Billy|Natasha Mills|McGill|Tsaka
If I want to merge/concatenate to derive Full Names and remove entries tied to "Smith" the final df should look like:
First Names Last Names Full Names
0 Jim|James|Tim Simth|Jacobs|Turner James Jacobs\nTim Turner
1 Mickey|Mini Mouse|Mouse Mickey Mouse\nMini Mouse
2 Mike|Billy|Natasha Mills|McGill|Tsaka Mike Mills\nBilly McGill\nNatasha Tsaka
My current approach so far has been:
def parse_merge(df, col1, col2, splitter, new_col, list_to_exclude):
orig_order = pd.Series(list(df.index)).rename('index')
col1_df = pd.concat([orig_order, df[col1], df[col1].str.split(splitter, expand=True)], axis = 1)
col2_df = pd.concat([orig_order, df[col2], df[col2].str.split(splitter, expand=True)], axis = 1)
col1_melt = pd.melt(col1_df, id_vars=['index', col1], var_name='count')
col2_melt = pd.melt(col2_df, id_vars=['index', col2], var_name='count')
col2_melt['value'] = '(' + col2_melt['value'].astype(str) + ')'
col2_melt = col2_melt.rename(columns={'value':'value2'})
melted_merge = pd.concat([col1_melt, col2_melt['value2']], axis = 1 )
if len(list_to_exclude) > 0:
list_map = map(re.escape, list_to_exclude)
melted_merge.ix[melted_merge['value2'].str.contains('|'.join(list_map)), ['value', 'value2']] = ''
melted_merge[new_col] = melted_merge['value'] + " " + melted_merge['value2']
if I call:
parse_merge(names, 'First Names', 'Last Names', 'Full Names', ['Smith'])
The data becomes:
Index First Names count value value2 Full Names
0 0 Jim|James|Tim 0 Jim Smith ''
1 1 Mickey|Mini 0 Mickey Mouse Mickey Mouse
2 2 Mike|Billy|Natasha 0 Mike Mills Mike Mills
Just not sure how to finish this out without any loops or if there is a more efficient / totally different approach.
Thanks for all the input!
Here is a condensed solution using pd.DataFrame.apply
and some of python's nice built-in features:
def combine_names(row):
pairs = list(zip(row[0].split('|'), row[1].split('|')))
return '\n'.join([' '.join(p) for p in pairs if p[1] != 'Simth'])
df['Full Name'] = df.apply(combine_names, axis=1)
I really like @AlexG's solution - please use it.
Here is my attempt to create a creative one-liner solution - it's absolutely perverse, so it should NOT be used - it's just for fun:
In [78]: df
Out[78]:
First Names Last Names
0 Jim|James|Tim Simth|Jacobs|Turner
1 Mickey|Mini Mouse|Mouse
2 Mike|Billy|Natasha Mills|McGill|Tsaka
In [79]: df['Full Names'] = \
...: (df.stack()
...: .str.split(r'\|', expand=True)
...: .unstack(level=1)
...: .groupby(level=0, axis=1)
...: .apply(lambda x: x.add(' ').sum(axis=1).str.strip())
...: .replace([r'\w+\s+Simth'], [np.nan], regex=True)
...: .apply(lambda x: x.dropna().str.cat(sep='\n'), axis=1)
...: )
...:
In [80]: df
Out[80]:
First Names Last Names Full Names
0 Jim|James|Tim Simth|Jacobs|Turner James Jacobs\nTim Turner
1 Mickey|Mini Mouse|Mouse Mickey Mouse\nMini Mouse
2 Mike|Billy|Natasha Mills|McGill|Tsaka Mike Mills\nBilly McGill\nNatasha Tsaka
I've got a lot of comprehension
l = df.values.tolist()
['|'.join(n)
for n in [[' '.join(z)
for z in zip(*[s.split('|')
for s in r]) if z[1] != 'Smith']
for r in l]]
['James Jacobs|Tim Turner',
'Mickey Mouse|Mini Mouse',
'Mike Mills|Billy McGill|Natasha Tsaka']
l = df.values.tolist()
df['Full Names'] = [
'|'.join(n)
for n in [[' '.join(z)
for z in zip(*[s.split('|')
for s in r]) if z[1] != 'Smith']
for r in l]]
df
word play aside, this is pretty snappy over sample data
longer explanation
l
[['Jim|James|Tim', 'Simth|Jacobs|Turner'],
['Mickey|Mini', 'Mouse|Mouse'],
['Mike|Billy|Natasha', 'Mills|McGill|Tsaka']]
l
is a list of lists. I will make extensive use of list comprehensions and iterables. (first, last)
names. I'll use if z[1] != 'Smith'
to filter out the smiths.
z[1] not in list_of_names
' '.join
(that's actually a function) to combine each tuple to first last
'|'.join
to combine the sub-list of first last
to first1 last1|first2 last2
... so on and so forth The reason why this is quicker is because comprehensions have been optimized to a great extent. The other solutions are using apply
which is a generic looping structure that can only leverage fast looping under special circumstances (someone who knows more, please correct me if I'm wrong). Using lambda
is definitely not one of those circumstnces.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.