I have a dataframe that contains some (text-) cleaned ads in one column and some very basic description of the same ads in one other column. I also have term frequencies stored in a dictionary in 'keyword':frequency format.
Task would be to purge all terms from the list in the df that falls below a certain cutpoint level of frequency.
import pandas as pd
adset = {"ID": ["(1483785165, 2009)", "(1538280431, 2010)", "(1795044103, 2010)"],
"Body":[['price', '#', 'bedrooms', '#', 'bathrooms', '#', 'garage'],['cindy', 'lavender', 'mid', 'state', 'realty'],['upgrades', 'galore', 'perfectly', 'maintained', 'home', 'formals']]}
df = pd.DataFrame(adset)
keyword_dict={}
for row in data['Body']:
for word in row:
if word in keyword_dict:
keyword_dict[word]+=1
else:
keyword_dict[word]=1
And here is where I got stuck:
def remove_sparse_words_from_df(df, term_freq, cutoff=1):
for row in df['Body']:
for word in row:
if term_freq[word]<=cutoff:
return df
My whole approach might be off - performance is a huge issue, the df has about 350k rows and the lists in the "Body" column might contain words ranging in number from a few hundred to few thousands. The reason for storing all the data in pandas df instead of lists is that I would like to keep the ID column, so I could later connect my data to some other analysis I've already done on the ads.
Any help is greatly appreciated :)
IIUC, try:
explode
to split the list to individual rowsgroupby
and transform
to get the count of the keyword in the dataframe and keep only rows where the "count" is greater than the cutoff groupby
and agg
to get the original DataFrame structure. cutoff = 1
df = df.explode("Body")
output = df.loc[df.groupby("Body")["ID"].transform("size").gt(1)].groupby("ID").agg(list)
>>> output
Body
ID
(1483785165, 2009) [#, #, #]
Note: In your example "#" is the only "word" that occurs more than once.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.