简体   繁体   English

是否有非循环方式在数据框中执行文本搜索

[英]Is there a non-looping way to perform text searching in a data frame

I have a huge list of ngrams to search.我有一个巨大的 ngram 列表要搜索。 I want to know what frequency they have on my historic dataframe and the mean of a numeric variable that I have on my historic.我想知道他们在我的历史 dataframe 上的频率以及我在我的历史上的数字变量的平均值。 I have a really really ugly way of doing it (that works), but as the list of ngrams is huge, it's really slow.我有一个非常丑陋的方法来做这件事(可行),但是由于 ngram 列表很大,所以它真的很慢。

I am trying to avoid doing the loop, as I guess is the main reason of my velocity problem, but I don't see how I can do it.我想避免做循环,因为我猜这是我的速度问题的主要原因,但我不知道我该怎么做。

Any idea?任何的想法?

output = pd.DataFrame()

ngrams = ['ngram1', 'ngram2', 'ngram3', ..., 'ngram350000']

for i in list(ngrams):
    temp = pd.DataFrame(data={'ngram' : [i],
                              'count' : historic_df['text_variable'].str.contains(i, na=False).sum(),
                              'mean' : historic_df[historic_df['text_variable'].str.contains(i, na=False)]['numeric_variable'].mean()})
    output = pd.concat([output, temp], axis=0)

Try DataFrame.apply()试试 DataFrame.apply()

def func(x):
    temp = pd.DataFrame(data={'ngram' : [i],
                              'count' : historic_df['text_variable'].str.contains(i, na=False).sum(),
                              'mean' : historic_df[historic_df['text_variable'].str.contains(i, na=False)]['numeric_variable'].mean()})
    output = pd.concat([output, temp], axis=0)
    return x

output = pd.DataFrame()

ngrams = pd.DataFrame({'ngram':['ngram1', 'ngram2', 'ngram3', ..., 'ngram350000']})

ngrams.apply(func)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Numpy中的非循环方式将一串字母转换成boolean数组(对应字符串的每个字母) - Non-looping way in Numpy to convert a string of letters into a boolean array (corresponding to each letter of the string) 熊猫在数据框中搜索日期格式和非日期格式 - Pandas Searching for Date formats and non-date formats in a Data Frame 我公司的非循环代码执行 dict(enumerate(list)).get(number) 而不仅仅是 list[number] 有什么好的理由吗? - Any good reason my company's non-looping code does dict(enumerate(list)).get(number) instead of just list[number]? 有没有比循环遍历更好的方法将数据框转换为“真值表”? - Is there a better way to transform a data frame into a “truth table” than looping through it? 在pandas数据框中搜索文本列而不进行循环 - Search over text column in pandas data frame without looping 在 pandas 数据帧上拆分和执行 function 的最有效方法 - Most efficient way to split and perform function on pandas data frame 在数据框中循环功能 - looping a function in a data frame 打开巨大的文本文件并执行正则表达式搜索 - Open huge text file and perform regex searching 有没有办法在不使用应用的情况下删除数据框中的非唯一行? - Is there a way to remove non unique rows in data frame without using apply? 循环方程以填充数据框 - Looping an Equation to Fill a Data frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM