简体   繁体   English

如何在文本文件中搜索一组单词?

[英]How to search for a set of words in a text file?

I'm writing a project on extracting a semantic orientation from a review stored in a text file. 我正在编写一个有关从文本文件中存储的评论中提取语义方向的项目。 I have a 400*2 array, each row contains a word and it's weight. 我有一个400 * 2的数组,每行包含一个单词及其权重。 I want to check which of these words is in the text file, and calculate the weight of the whole content. 我想检查这些单词中的哪个在文本文件中,并计算整个内容的权重。

My question is - 我的问题是-

what is the most efficient way to do it? 最有效的方法是什么? Should I search for each word separately, for example with a for loop? 我是否应该单独搜索每个单词,例如使用for循环? Do I get any benefit from storing the content of the text file in a string object? 通过将文本文件的内容存储在字符串对象中,可以获得任何好处吗?

https://docs.python.org/3.6/library/mmap.html https://docs.python.org/3.6/library/mmap.html

This may work for you. 这可能对您有用。 You can use find 您可以使用查找

This may be out of the box thinking, but if you don't care for semantic/grammatic connection of the words: 这可能是开箱即用的想法,但是如果您不关心单词的语义/语法连接:

  • sort all words from the text by length 按长度排序文本中的所有单词
  • sort your array by length 按长度排序数组

.

  • Write a for-loop: 编写一个for循环:
  • Call len() (length) on each word from the text. 对文本中的每个单词调用len() (长度)。
  • Then only check against those words which have the same length. 然后,只检查长度相同的单词。

With some tinkering it might give you a good performance boost instead of the "naive" search. 稍加修改后,它可能会给您带来良好的性能提升,而不是“天真的”搜索。

Also look into search algorithms if you want to achieve an additional boost (concerning finding the first word (of the 400) with eg 6 letters - then go "down" the list until the first word with 5 letters comes up, then stop. 如果您想获得额外的提升(例如,找到400个单词中的第一个单词,例如6个字母),也可以查看搜索算法-然后“向下”列表直至出现5个字母的第一个单词,然后停止。

Alternatively you could also build an index array with the indexes of the first and last of all 5-letter words (analog for the rest), assuming your words dont change. 另外,假设您的单词没有变化,您还可以使用所有5个字母的单词的第一个和最后一个(对于其余单词为模拟)的索引来构建索引数组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM