什么是仅读取文本文件中完整单词的 python 代码（词法分析仅检测整个单词）？

Question

我想获取构成口语中整个单词的文本组（由空格分隔的文本组被视为单词）。 例如，当我想在文本文件中查找单词is时，即使该文件不包含单词 is ，也会检测到单词 s is ter 中的 is 。 我对词法分析有所了解，但无法将其应用于我的项目。 有人可以提供这种情况下的python代码吗？

这是我使用的代码，但它导致了上述问题。

 words_to_find = ("test1", "test2", "test3")
    line = 0
    #User_Input.txt is a file saved in my computer which i used as the input of the system 
    with open("User_Input.txt", "r") as f:
        txt = f.readline()
        line += 1
        for word in words_to_find:
            if word in txt:
                print(F"Word: '{word}' found at line {line}, " 
                      F"pos: {txt.index(word)}")

Answer 1

你应该使用spacy来标记你的列表，因为自然语言往往很棘手，除了它的所有例外，还有什么不是：

from spacy.lang.en import English

nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.Defaults.create_tokenizer(nlp)
txt = f.readlines()
line += 1
for txt_line in txt:
    [print(f'Word {word} found at line {line}; pos: {txt.index(word)}') for word in nlp(txt)]

或者，您可以通过以下方式使用textblob ：

# from textblob import TextBlob
txt = f.readlines()
blob = TextBlob(txt)
for index, word in enumerate(list(blob.words)):
    line = line + 1
    print(f'Word {word.text} found in position {index} at line {line}')

Answer 2

使用nltk以稳健的方式标记您的文本。 另外，请记住文本中的单词可能是大小写混合的。 在搜索之前将它们转换为小写。

import nltk
words = nltk.word_tokenize(txt.lower())

Answer 3

一般的正则表达式，特别是\\b术语，这意味着“单词边界”是我将单词与其他任意字符分开的方式。 下面是一个例子：

import re
 
# words with arbitrary characters in between
data = """now is;  the time for, all-good-men
to come\t to the, aid of 
their... country"""

exp = re.compile(r"\b\w+")

pos = 0
while True:
    m = exp.search(data, pos)
    if not m:
        break
    print(m.group(0))
    pos = m.end(0)

结果：

now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country

Answer 4

您可以使用正则表达式：

import re

words_to_find = ["test1", "test2", "test3"] # converted this to a list to use `in`
line = 0
with open("User_Input.txt", "r") as f:
  txt = f.readline()
  line += 1
  rx = re.findall('(\w+)', txt) # rx will be a list containing all the words in `txt`

  # you can iterate for every word in a line
  for word in rx: # for every word in the RegEx list
    if word in words_to_find: print(word)

    # or you can iterate through your search case only
    # note that this will find only the first occurance of each word in `words_to_find`
    for word in words_to_find # `test1`, `test2`, `test3`...
      if word in rx: print(word) # if `test1` is present in this line's list of words...

上面的代码做了什么，它将(\\w+) RegEx 应用于您的文本字符串并返回匹配列表。 在这种情况下，RegEx 将匹配任何由空格分隔的单词。

有用的资源： Debuggex测试正则表达式， Python的正则表达式和RegExr学习更多关于正则表达式。

Answer 5

如果您尝试在文本文件中查找单词 test1、test2 或 test3，则不需要手动增加行值。 假设文本文件在单独的行中包含每个单词，以下代码有效

words_to_find = ("test1", "test2", "test3")
file = open("User_Input.txt", "r").readlines()
for line in file:
    txt = line.strip('\n')
    for word in words_to_find:
        if word in txt:
            print(F"Word: '{word}' found at line {file.index(line)+1}, "F"pos: {txt.index(word)}")

我不知道什么位置意味着表示虽然。

Answer 6

我认为只需将空格放在字符串参数中即可。

什么是仅读取文本文件中完整单词的 python 代码（词法分析仅检测整个单词）？

问题描述

6 个解决方案

解决方案1
3 2020-10-11 03:19:38

解决方案2
1 2020-10-11 03:30:14

解决方案3
0 2020-10-11 03:25:14

解决方案4
0 2020-10-11 03:29:56

解决方案5
-1 2020-10-11 03:21:55

解决方案6
-2 2020-10-11 03:08:27

什么是仅读取文本文件中完整单词的 python 代码（词法分析仅检测整个单词）？

问题描述

6 个解决方案

解决方案1 3 2020-10-11 03:19:38

解决方案2 1 2020-10-11 03:30:14

解决方案3 0 2020-10-11 03:25:14

解决方案4 0 2020-10-11 03:29:56

解决方案5 -1 2020-10-11 03:21:55

解决方案6 -2 2020-10-11 03:08:27

解决方案1
3 2020-10-11 03:19:38

解决方案2
1 2020-10-11 03:30:14

解决方案3
0 2020-10-11 03:25:14

解决方案4
0 2020-10-11 03:29:56

解决方案5
-1 2020-10-11 03:21:55

解决方案6
-2 2020-10-11 03:08:27