简体   繁体   English

如何通过知道 python 中单词的偏移量从文本文件中获取原始句子?

[英]How to get the original sentence from a text file by knowing an offset of a word in python?

I am new to python and I wonder if there is an efficient way to find the original sentence from a text file by knowing an offset of a word.我是 python 的新手,我想知道是否有一种有效的方法可以通过知道单词的偏移量从文本文件中找到原始句子。 Suppose that I have a test.txt file like this:假设我有一个这样的 test.txt 文件:

test.txt测试.txt

Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.

Suppose that I know the offset of the word "wheat" which is [13,18].假设我知道单词“wheat”的偏移量是 [13,18]。

My codes look like this:我的代码如下所示:

import nltk
from nltk.tokenize import word_tokenize

with open("test.txt") as f:
    list_phrase = f.readlines()
    f.seek(0)
    contents = f.read()
    for index, phrase in enumerate(list_phrase):
        j = word_tokenize(phrase)
        if contents[13:18] in j:
            print(list_phrase[index])

The output of my codes will print both sentences ie ( "Ceci est une wheat phrase corn." and "This is the third wheat word." )我的代码 output 将打印两个句子,即(“Ceci est une wheat phrase corn.”和“This is the third wheat word.”)

How to detect exactly the real phrase of a word by knowing its offset?如何通过知道偏移量来准确检测单词的真实短语?

Note that the offset that I considered continues between many sentences (2 sentences in this case).请注意,我考虑的偏移量在许多句子之间继续存在(在本例中为 2 个句子)。 For example, the offset of the word "barley" should be [61,67].例如, “大麦”一词的偏移量应为 [61,67]。

The desire output of the print above should be:上面打印的愿望 output 应该是:

Ceci est une wheat phrase corn.

As we know that its offset is [13,18].我们知道它的偏移量是 [13,18]。

Any help for this would be much appreciated.对此的任何帮助将不胜感激。 Thank you so much!太感谢了!

If you are looking for raw speed then the standard library is probably the best approach to take.如果您正在寻找原始速度,那么标准库可能是最好的方法。

# Generate a large text file with 10,000,001 lines.
with open('very-big.txt', 'w') as file:
    for _ in range(10000000):
        file.write("All work and no play makes Jack a dull boy.\n")
    file.write("Finally we get to the line containing the word 'wheat'.\n")

Given the search_word and its offset in the line we're looking for we can calculate the limit for the string comparison.给定search_word及其在我们要查找的行中的offset ,我们可以计算字符串比较的limit

search_word = 'wheat'
offset = 48
limit = offset + len(search_word)

The simplest approach is to iterate over the enumerated lines of text and perform a string comparison on each line.最简单的方法是遍历枚举的文本行并对每一行执行字符串比较。

with open('very-big.txt', 'r') as file:
    for line, text in enumerate(file, start=1):
        if (text[offset:limit] == search_word):
            print(f'Line {line}: "{text.strip()}"')

The runtime for this solution is 155 ms on a 2012 Mac mini (2.3GHz i7 CPU).此解决方案在 2012 Mac mini(2.3GHz i7 CPU)上的运行时间为155 ms That seems pretty fast for processing 10,000,001 lines but it can be improved upon by checking the length of the text before attempting the string comparison.这对于处理 10,000,001 行来说似乎相当快,但可以通过在尝试字符串比较之前检查文本的长度来改进它。

with open('very-big.txt', 'r') as file:
    for line, text in enumerate(file, start=1):
        if (len(text) >= limit) and (text[offset:limit] == search_word):
            print(f'Line {line}: "{text.strip()}"')

The runtime for the improved solution is 71 ms on the same computer.改进后的解决方案在同一台计算机上的运行时间为71 ms It's a significant improvement but of course mileage will vary depending on the text file.这是一个显着的改进,但当然里程会因文本文件而异。

Generated output:生成 output:

Line 10000001: "Finally we get to the line containing the word 'wheat'."

EDIT: Including file offset information编辑:包括文件偏移信息

with open('very-big.txt', 'r') as file:
    file_offset = 0
    for line, text in enumerate(file, start=1):
        line_length = len(text)
        if line_length >= limit and (text[offset:limit] == search_word):
            print(f'[{file_offset + offset}, {file_offset + limit}] Line {line}: "{text.strip()}"')
        file_offset += line_length

Sample output:样本 output:

[430000048, 430000053] Line 10000001: "Finally we get to the line containing the word 'wheat'."

Encore une fois再来一次

This code checks if the known offset of the text is between the values of the offset of the start of the current line and the end of the line.此代码检查文本的已知偏移量是否介于当前行开头和行尾的偏移量值之间。 The text found at the offset is also verified.在偏移处找到的文本也已验证。

long_string = """Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.
"""

import io

search_word = 'barley'
known_offset = 61
limit = known_offset + len(search_word)

# Use the multi-line string defined above as file input
with io.StringIO(long_string) as file:
    file_offset = 0
    for line, text in enumerate(file, start=1):
        line_length = len(text)
        if file_offset < known_offset < (file_offset + line_length) \
        and (text[(known_offset-file_offset):(limit-file_offset)] == search_word):
            print(f'[{known_offset},{limit}]\nLine: {line}\n{text}')
        file_offset += line_length

Output: Output:

[61,67]
Line: 2
Ceci est une deuxième phrase barley.

If you already know the position of the word, tokenizing is not what you want to do.如果您已经知道单词的 position,那么标记化不是您想要做的。 By tokenizing, you change the sequence (for which you know the position) to a list of words, where you don't know which element is your word.通过标记化,您可以将序列(您知道其位置)更改为单词列表,而您不知道哪个元素是您的单词。

Therefore, you should leave it at the phrase and just compare the part of the phrase with your word:因此,您应该将它留在短语中,只需将短语的一部分与您的单词进行比较:

with open("test.txt") as f:
    list_phrase = f.readlines()
    f.seek(0)
    contents = f.read()
    for index, phrase in enumerate(list_phrase):
        if phrase[13:18].lower() == "wheat": ## .lower() is only necessary if the word might be in upper case.
            print(list_phrase[index])

This would only return the sentences where wheat is at the position [13:18] .这只会返回wheat位于 position [13:18]的句子。 All other occurrences of wheat would not be recognized.所有其他出现的 wheat 都不会被识别。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用python在预处理的句子中取回单词的原始位置? - How to get back original position of a word in a preprocessed sentence using python? 如何从文本文件计算python 2.7中的平均单词和句子长度 - How to calculate average word & Sentence length in python 2.7 from a text file 如何从删除文本文件中获取第一个单词 \n - python - How to get first word from text file removing \n - python 如何从python3中的文本文件获取段落的开始和结束偏移量 - How to get start and end offset of Paragraph from text file in python3 我如何通过仅从文本文件中查找一个单词来挑选一个句子 - How would I pick out a sentence by only looking for one word from a text file 如何从句子的开头匹配特定的单词并获得Python中不匹配的部分? - How to match a specific word from the beginning of a sentence and get the unmatched part in Python? 从具有精确单词匹配的句子列表中获取句子:Python - get sentence from list of sentences with exact word match : Python Python:如何在不知道文件实际存在多长时间的情况下从文件中读取一大块文本? - Python: How do you read a chunk of text from a file without knowing how long the file actually is? 如何从转换后的句子中的子串中获取原句的子串 - How to get the sub-string of the original sentence from a sub-string in the converted sentence 从 nltk word_tokenize 获取原始文本的索引 - get indices of original text from nltk word_tokenize
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM