[英]Is there a non-looping way to perform text searching in a data frame
I have a huge list of ngrams to search.我有一个巨大的 ngram 列表要搜索。 I want to know what frequency they have on my historic dataframe and the mean of a numeric variable that I have on my historic.我想知道他们在我的历史 dataframe 上的频率以及我在我的历史上的数字变量的平均值。 I have a really really ugly way of doing it (that works), but as the list of ngrams is huge, it's really slow.我有一个非常丑陋的方法来做这件事(可行),但是由于 ngram 列表很大,所以它真的很慢。
I am trying to avoid doing the loop, as I guess is the main reason of my velocity problem, but I don't see how I can do it.我想避免做循环,因为我猜这是我的速度问题的主要原因,但我不知道我该怎么做。
Any idea?任何的想法?
output = pd.DataFrame()
ngrams = ['ngram1', 'ngram2', 'ngram3', ..., 'ngram350000']
for i in list(ngrams):
temp = pd.DataFrame(data={'ngram' : [i],
'count' : historic_df['text_variable'].str.contains(i, na=False).sum(),
'mean' : historic_df[historic_df['text_variable'].str.contains(i, na=False)]['numeric_variable'].mean()})
output = pd.concat([output, temp], axis=0)
Try DataFrame.apply()试试 DataFrame.apply()
def func(x):
temp = pd.DataFrame(data={'ngram' : [i],
'count' : historic_df['text_variable'].str.contains(i, na=False).sum(),
'mean' : historic_df[historic_df['text_variable'].str.contains(i, na=False)]['numeric_variable'].mean()})
output = pd.concat([output, temp], axis=0)
return x
output = pd.DataFrame()
ngrams = pd.DataFrame({'ngram':['ngram1', 'ngram2', 'ngram3', ..., 'ngram350000']})
ngrams.apply(func)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.