如何通过知道 python 中单词的偏移量从文本文件中获取原始句子？

Question

I am new to python and I wonder if there is an efficient way to find the original sentence from a text file by knowing an offset of a word.我是 python 的新手，我想知道是否有一种有效的方法可以通过知道单词的偏移量从文本文件中找到原始句子。 Suppose that I have a test.txt file like this:假设我有一个这样的 test.txt 文件：

test.txt测试.txt

Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.

Suppose that I know the offset of the word "wheat" which is [13,18].假设我知道单词“wheat”的偏移量是 [13,18]。

My codes look like this:我的代码如下所示：

import nltk
from nltk.tokenize import word_tokenize

with open("test.txt") as f:
    list_phrase = f.readlines()
    f.seek(0)
    contents = f.read()
    for index, phrase in enumerate(list_phrase):
        j = word_tokenize(phrase)
        if contents[13:18] in j:
            print(list_phrase[index])

The output of my codes will print both sentences ie ( "Ceci est une wheat phrase corn." and "This is the third wheat word." )我的代码 output 将打印两个句子，即（“Ceci est une wheat phrase corn.”和“This is the third wheat word.”）

How to detect exactly the real phrase of a word by knowing its offset?如何通过知道偏移量来准确检测单词的真实短语？

Note that the offset that I considered continues between many sentences (2 sentences in this case).请注意，我考虑的偏移量在许多句子之间继续存在（在本例中为 2 个句子）。 For example, the offset of the word "barley" should be [61,67].例如， “大麦”一词的偏移量应为 [61,67]。

The desire output of the print above should be:上面打印的愿望 output 应该是：

Ceci est une wheat phrase corn.

As we know that its offset is [13,18].我们知道它的偏移量是 [13,18]。

Any help for this would be much appreciated.对此的任何帮助将不胜感激。 Thank you so much!太感谢了！

Answer 1

If you are looking for raw speed then the standard library is probably the best approach to take.如果您正在寻找原始速度，那么标准库可能是最好的方法。

# Generate a large text file with 10,000,001 lines.
with open('very-big.txt', 'w') as file:
    for _ in range(10000000):
        file.write("All work and no play makes Jack a dull boy.\n")
    file.write("Finally we get to the line containing the word 'wheat'.\n")

Given the search_word and its offset in the line we're looking for we can calculate the limit for the string comparison.给定search_word及其在我们要查找的行中的offset ，我们可以计算字符串比较的limit 。

search_word = 'wheat'
offset = 48
limit = offset + len(search_word)

The simplest approach is to iterate over the enumerated lines of text and perform a string comparison on each line.最简单的方法是遍历枚举的文本行并对每一行执行字符串比较。

with open('very-big.txt', 'r') as file:
    for line, text in enumerate(file, start=1):
        if (text[offset:limit] == search_word):
            print(f'Line {line}: "{text.strip()}"')

The runtime for this solution is 155 ms on a 2012 Mac mini (2.3GHz i7 CPU).此解决方案在 2012 Mac mini（2.3GHz i7 CPU）上的运行时间为155 ms 。 That seems pretty fast for processing 10,000,001 lines but it can be improved upon by checking the length of the text before attempting the string comparison.这对于处理 10,000,001 行来说似乎相当快，但可以通过在尝试字符串比较之前检查文本的长度来改进它。

with open('very-big.txt', 'r') as file:
    for line, text in enumerate(file, start=1):
        if (len(text) >= limit) and (text[offset:limit] == search_word):
            print(f'Line {line}: "{text.strip()}"')

The runtime for the improved solution is 71 ms on the same computer.改进后的解决方案在同一台计算机上的运行时间为71 ms 。 It's a significant improvement but of course mileage will vary depending on the text file.这是一个显着的改进，但当然里程会因文本文件而异。

Generated output:生成 output：

Line 10000001: "Finally we get to the line containing the word 'wheat'."

EDIT: Including file offset information编辑：包括文件偏移信息

with open('very-big.txt', 'r') as file:
    file_offset = 0
    for line, text in enumerate(file, start=1):
        line_length = len(text)
        if line_length >= limit and (text[offset:limit] == search_word):
            print(f'[{file_offset + offset}, {file_offset + limit}] Line {line}: "{text.strip()}"')
        file_offset += line_length

Sample output:样本 output：

[430000048, 430000053] Line 10000001: "Finally we get to the line containing the word 'wheat'."

Encore une fois再来一次

This code checks if the known offset of the text is between the values of the offset of the start of the current line and the end of the line.此代码检查文本的已知偏移量是否介于当前行开头和行尾的偏移量值之间。 The text found at the offset is also verified.在偏移处找到的文本也已验证。

long_string = """Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.
"""

import io

search_word = 'barley'
known_offset = 61
limit = known_offset + len(search_word)

# Use the multi-line string defined above as file input
with io.StringIO(long_string) as file:
    file_offset = 0
    for line, text in enumerate(file, start=1):
        line_length = len(text)
        if file_offset < known_offset < (file_offset + line_length) \
        and (text[(known_offset-file_offset):(limit-file_offset)] == search_word):
            print(f'[{known_offset},{limit}]\nLine: {line}\n{text}')
        file_offset += line_length

Output: Output：

[61,67]
Line: 2
Ceci est une deuxième phrase barley.

Answer 2

If you already know the position of the word, tokenizing is not what you want to do.如果您已经知道单词的 position，那么标记化不是您想要做的。 By tokenizing, you change the sequence (for which you know the position) to a list of words, where you don't know which element is your word.通过标记化，您可以将序列（您知道其位置）更改为单词列表，而您不知道哪个元素是您的单词。

Therefore, you should leave it at the phrase and just compare the part of the phrase with your word:因此，您应该将它留在短语中，只需将短语的一部分与您的单词进行比较：

with open("test.txt") as f:
    list_phrase = f.readlines()
    f.seek(0)
    contents = f.read()
    for index, phrase in enumerate(list_phrase):
        if phrase[13:18].lower() == "wheat": ## .lower() is only necessary if the word might be in upper case.
            print(list_phrase[index])

This would only return the sentences where wheat is at the position [13:18] .这只会返回wheat位于 position [13:18]的句子。 All other occurrences of wheat would not be recognized.所有其他出现的 wheat 都不会被识别。

如何通过知道 python 中单词的偏移量从文本文件中获取原始句子？

问题描述

2 个解决方案

解决方案1
2 2021-08-11 03:37:01

解决方案2
0 2021-08-16 19:07:28

如何通过知道 python 中单词的偏移量从文本文件中获取原始句子？

问题描述

2 个解决方案

解决方案1 2 2021-08-11 03:37:01

解决方案2 0 2021-08-16 19:07:28

解决方案1
2 2021-08-11 03:37:01

解决方案2
0 2021-08-16 19:07:28