phrases = ['i am good', 'going to the market', 'eating cookies']
dictionary = {'http://www.firsturl.com': 'i am going to the market and tomorrow will be eating cookies',
'http://www.secondurl.com': 'tomorrow is my birthday and i shall be',
'http://www.thirdurl.com': 'i am good and will go to sleep'}
if there is at least a match: expected output:
url phrasecount phrase
http://www.firsturl.com 2 going to the market, eating cookies
http://www.thirdurl.com 1 i am good
If there is no match from all 3 urls the return just the first occurrence url with zero count and blank phrase expected output:
url phrasecount phrase
http://www.firsturl.com 0
Setup the initial dataframe df
from corresponding dictionary
:
df = pd.DataFrame({'urls': list(dictionary.keys()), 'strings': list(dictionary.values())})
pattern = '|'.join(phrases)
Process the dataframe:
s = df.pop('strings').str.findall(pattern)
df = df.assign(phrasecount=s.str.len(), phrase=s.map(', '.join))
df = df.drop_duplicates(subset='phrasecount') if df['phrasecount'].eq(0).all() else df[df['phrasecount'].ne(0)]
Result:
# print(df)
urls phrasecount phrase
0 http://www.firsturl.com 2 going to the market, eating cookies
2 http://www.thirdurl.com 1 i am good
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.