简体   繁体   English

找出列表中三个单词在给定文档中共同出现的频率

[英]find the frequency of three words that are in the list co-occured in combination in the given document

I have multiple.txt files in desktop loaded as a data frame in python.我在桌面中有多个.txt 文件作为 python 中的数据框加载。 I am working in python in the data frame, where 'text' is the name of the column that is my interest.我在数据框中的 python 中工作,其中“文本”是我感兴趣的列的名称。 Column 'text' consists of multiple.txt documents. “文本”列由多个 .txt 文档组成。

I also have three lists of words in mind: these are:我还想到了三个单词列表:它们是:

credit=['borrow', 'lend'],
policy=['Fed', 'fund rate','zero'],
trade=['deficit', 'surplus'],

My goal is to construct the index that measures the frequency of any of the words from the three lists in combination in a given sentence in the text file by applying it for each document separately.我的目标是构建一个索引,通过分别对每个文档应用它来测量文本文件中给定句子中三个列表中的任何单词的频率。 For example if 'borrow', 'fund' and 'surplus' co-occurred in a given sentence, it will enter code here be counted as 1.例如,如果给定句子中同时出现“借”、“基金”和“盈余”,则enter code here将被计为 1。

I know how to do it to count using a single word as follows:我知道如何使用单个单词进行计数,如下所示:

my_dir_path ='C:/Users/desktop'
results = defaultdict(list)
for file in Path(my_dir_path).iterdir():
    with open(file, "r") as file_open:
        results["file_name"],(file.name)
        results["text"].append(file_open.read())
        df = pd.DataFrame(results)

to get the frequencty of the word policy across documents I used this code:为了获得跨文档的单词策略的频率,我使用了以下代码:

df['policy']=df['text'].apply(lambda x: len([word for word in x.split() if     word=='policy']))

How can I do it in python?如何在 python 中做到这一点? Thanks in advance for any help?提前感谢您的帮助?

I'd be tempted to use regular expressions for doing the matching of words within sentences, using lookahead/behind we could use something like:我很想使用正则表达式来匹配句子中的单词,使用lookahead/behind我们可以使用类似的东西:

(?<!\w)borrow(?!\w)

and would find "borrow" in "can I borrow that" and "will borrow."并且会在"can I borrow that""will borrow." but not "borrowing" .但不是"borrowing" I'm unsure what you actually want to do here, but I'd suggest learning how to use regular expressions as they would allow you to express these options easily我不确定你到底想在这里做什么,但我建议学习如何使用正则表达式,因为它们可以让你轻松表达这些选项

to make the following code shorter, I define a function to compile a "word" into a regex object:为了使以下代码更短,我定义了一个 function 来将“单词”编译为正则表达式 object:

import re

def matcher(word):
    return re.compile(fr'(?<!\w){word}(?!\w)', re.IGNORECASE)

re_credit = [
    matcher('borrow'),
    matcher('fund'),
]

next I write a function to split a string up into sentences so we can count co-occurances of words:接下来我写了一个 function 把一个字符串分成句子,这样我们就可以计算单词的共现:

from nltk.tokenize import sent_tokenize

def count_sentences_matching_words(text, regexes):
    count = 0
    for sentence in sent_tokenize(text):
        if all(reg.search(sentence) for reg in regexes):
            count += 1
    return count

next we can test it with some text:接下来我们可以用一些文本来测试它:

para = "My goal is to construct the index that measures the frequency of any of the words from the three lists in combination in a given sentence in the text file by applying it for each document separately. For example if 'borrow', 'fund' and 'surplus' co-occurred in a given sentence, it willenter code here be counted as 1."

count_sentences_matching_words(para, re_credit)

if you wanted to use this with pandas you could do the obvious:如果您想将其与 pandas 一起使用,您可以做的很明显:

df['credit'] = df['text'].apply(lambda x: count_sentences_matching_words(x, re_credit))

it's probably worth rearranging this code, eg just doing the sentence tokenization once per file.重新排列这段代码可能是值得的,例如,每个文件只做一次句子标记。 but it would depend on more details than you've shared但这取决于您分享的更多细节

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM