简体   繁体   中英

Extracting all patterns from pandas data frame column (python3)

I'm using jupyter notebook (python 3). I'm trying to extract from pandas data frame keywords from my list. I will have around 50 keywords in the list.

Example:

import pandas as pd
import re

rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']

 

pattern = "\\b("+'|'.join(rgx_words1)+")\\b"

re_patt = re.compile(pattern)

 

pattern2 = "("+'|'.join(rgx_words1)+")"

re_patt2 = re.compile(pattern2)

 

data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]

  

# Create the pandas DataFrame

mydf = pd.DataFrame(data, columns = ['id', 'text'])

 

mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt,x['text']),axis=1)

mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2,x['text']),axis=1)

With re_patt I'm extracting exact words and I'm getting correct results. In id 1 my output is algaecide, algaecid, algaecides. With re_patt2 I would like to have all patterns like ''ssssalgaecidllll' with wanted output 'algaecid'. Output with re_patt2 in id 1 is algaecid, algaecid, algaecid and my wanted output is algaecide, algaecid, algaecides. I would be grateful for any advice. Thank you in advance.

You can change pattern2 to optionally match non whitespace chars except a comma [^\s,]* at the left and the right.

pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"

The code could look like

import pandas as pd
import re

rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']

pattern = "\\b("+'|'.join(rgx_words1)+")\\b"
re_patt = re.compile(pattern)

pattern2 = "[^\s,]*(?:"+'|'.join(rgx_words1)+")[^\s,]*"
re_patt2 = re.compile(pattern2)

data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]
mydf = pd.DataFrame(data, columns = ['id', 'text'])

mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt, x['text']), axis=1)
mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2, x['text']), axis=1)

print(mydf)

Output

   id                                               text                            matches                           matches2
0   1  I, will, find, algaecide, dd, algaecid, algaec...  [algaecide, algaecid, algaecides]  [algaecide, algaecid, algaecides]
1   2                       fff, algaecid, dd, algaecide              [algaecid, algaecide]              [algaecid, algaecide]
2   3                       ssssalgaecidllll, algaecides                       [algaecides]     [ssssalgaecidllll, algaecides]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM