how to find phrases in dictionary from a list of phrases and creating dataframe with phrase found and count. duplicates should be counted

Question

phrases = ['i am good', 'going to the market', 'eating cookies']

dictionary = {'http://www.firsturl.com': 'i am going to the market and tomorrow will be eating cookies',
             'http://www.secondurl.com': 'tomorrow is my birthday and i shall be', 
             'http://www.thirdurl.com': 'i am good and will go to sleep'}

if there is at least a match: expected output:

url                             phrasecount    phrase
http://www.firsturl.com         2              going to the market, eating cookies
http://www.thirdurl.com         1              i am good

If there is no match from all 3 urls the return just the first occurrence url with zero count and blank phrase expected output:

url                            phrasecount    phrase
http://www.firsturl.com        0

Answer 1

Setup the initial dataframe df from corresponding dictionary :

df = pd.DataFrame({'urls': list(dictionary.keys()), 'strings': list(dictionary.values())})
pattern = '|'.join(phrases)

Process the dataframe:

s = df.pop('strings').str.findall(pattern)
df = df.assign(phrasecount=s.str.len(), phrase=s.map(', '.join))
df = df.drop_duplicates(subset='phrasecount') if df['phrasecount'].eq(0).all() else df[df['phrasecount'].ne(0)]

Result:

# print(df)

                      urls  phrasecount                               phrase
0  http://www.firsturl.com            2  going to the market, eating cookies
2  http://www.thirdurl.com            1                            i am good

how to find phrases in dictionary from a list of phrases and creating dataframe with phrase found and count. duplicates should be counted

Question

1 answers

solution1
0 ACCPTED 2020-06-02 15:27:13

how to find phrases in dictionary from a list of phrases and creating dataframe with phrase found and count. duplicates should be counted

Question

1 answers

solution1 0 ACCPTED 2020-06-02 15:27:13

solution1
0 ACCPTED 2020-06-02 15:27:13