简体   繁体   English

从另一个文本文件中提取一个文本文件中的句子

[英]Extracting the sentences from one text file from another text file

I have two txt files, one which is very large (txt file 1) with 15000 sentences, all broken down in a set format (sentence index, word, tag) per line.我有两个 txt 文件,一个非常大(txt 文件 1)有 15000 个句子,每行都按固定格式(句子索引、单词、标签)​​进行分解。 I have another text file (txt file 2) with about 500 sentences broken down into the format (sentence index, word).我有另一个文本文件(txt 文件 2),其中大约 500 个句子分解为格式(句子索引、单词)。 I want to find the sentences from "txt file 2" that are in "txt file 1", but i also need to extract the tags as well.我想从“txt 文件 2”中找到“txt 文件 1”中的句子,但我还需要提取标签。

format for txt file 1: txt文件格式1:

1   Flurazepam  O
2   thus    O
3   appears O
4   to  O
5   be  O
6   an  O
7   effective   O
8   hypnotic    O
9   drug    O
10  with    O

format for txt file 2: txt 文件 2 的格式:

1   More
2   importantly
3   ,
4   this
5   fusion
6   converted
7   a
8   less
9   effective
10  vaccine

Initially, i just tried something silly:最初,我只是尝试了一些愚蠢的事情:

txtfile1=open("/Users/Desktop/Final.txt").read().split('\n')


with open ('/Users/Desktop/sentenceineed.txt','r') as txtfile2:

   whatineed=[]
   for line in txtfile2:
       for part in txtfile1:
           if line == part: 
               whatineed.append(part)

I'm getting nothing with this attempt, literally an empty list.我对这次尝试一无所获,实际上是一个空列表。 any suggestions would be great.任何建议都会很棒。

Since your first file is much larger than your second, you want to avoid putting the first file in memory all at once.由于您的第一个文件比第二个文件大得多,因此您希望避免将第一个文件一次全部放入内存中。 Putting the second file in memory is no problem.将第二个文件放在内存中是没有问题的。 A dictionary would be an ideal data type for this memory, since you can quickly find if a word exists in the dictionary and can quickly retrieve its index.字典将是这种内存的理想数据类型,因为您可以快速查找字典中是否存在某个单词并可以快速检索其索引。

So think of your problem this way--find all the words in your first text file that are also in your second text file.因此,请以这种方式思考您的问题 - 查找第一个文本文件中的所有单词,这些单词也在您的第二个文本文件中。 So here is an algorithm in pseudo-code.所以这是一个伪代码算法。 You do not specify how the "output" is to be done, so I just generically called it "storage."您没有指定如何完成“输出”,所以我只是一般地将其称为“存储”。 You do not state if either "index" of the word is to be in the output, so I put it there.你没有说明这个词的“索引”是否要出现在输出中,所以我把它放在那里。 That would be trivial to remove, if you want.如果您愿意,删除这将是微不足道的。

Initialize a dictionary to empty
for each line in text_file_2:
    parse the index and the word
    Add the word as the key and the index as the value to the dictionary
Initialize the storage for the final result
for each line in text_file_1:
    parse the index, word, and tag
    if the word exists in the dictionary:
        retrieve the index from the dictionary
        store the word, tag, and both indices

Here is code for that algorithm.这是该算法的代码。 I left it "expanded" rather than using comprehensions, for ease of understanding and debugging.为了便于理解和调试,我对其进行了“扩展”而不是使用推导式。

dictfile2 = dict()
with open('txtfile2.txt') as txtfile2:
    for line2 in txtfile2:
        index2, word2 = line2.strip().split()
        dictfile2[word2] = index2
listresult = list()
with open('txtfile1.txt') as txtfile1:
    for line1 in txtfile1:
        index1, word1, tag1 = line1.strip().split()
        if word1 in dictfile2:
            index2 = dictfile2[word1]
            listresult.append((word1, tag1, int(index1), int(index2)))

Here is the result of that code, given your example data, for print(listresult) .这是该代码的结果,给定您的示例数据,用于print(listresult) You may want a different format for the result.您可能需要不同的结果格式。

[('effective', 'O', 7, 9)]

@Rory Daulton pointed out it correctly. @Rory Daulton 正确地指出了这一点。 Since your first file may got large enough to load it completely into the memory and you should rather iterate it.由于您的第一个文件可能足够大以将其完全加载到内存中,因此您应该对其进行迭代。

Here I am writing my solution to the problem.在这里,我正在写我对问题的解决方案。 You can make necessary/desired changes for your implementation.您可以对您的实施进行必要/期望的更改。

Program程序

dict_one = {} # Creating empty dictionary for Second File
textfile2 = open('textfile2', 'r') 

# Reading textfile2 line by line and adding index and word to dictionary
for line in textfile2:
    values = line.split(' ')
    dict_one[values[0].strip()] = values[1].strip()

textfile2.close()

outfile = open('output', 'w') # Opening file for output
textfile1 = open('textfile1', 'r') # Opening first file

# Reading first file line by line
for line in textfile1:
    values = line.split(' ') 
    word = values[1].strip() # Extracting word from the line

    # Matching if word exists in dictionary
    if word in dict_one.values():
        # If word exists then writing index, word and tag to the output file
        outfile.write("{} {} {}\n".format(values[0].strip(), values[1].strip(), values [2].strip()))

outfile.close()
textfile1.close()

Text File 1文本文件 1

1 Flurazepam O
2 thus O
3 appears I
4 to O
5 be O
6 an O
7 effective B
8 hypnotic B
9 drug O
10 less O
11 converted I
12 maxis O
13 fusion I
14 grave O
15 public O
16 mob I
17 havoc I
18 boss O
19 less B
20 diggy I

Text File 2文本文件 2

1 More
2 importantly
3 ,
4 this
5 fusion
6 converted
7 a
8 less
9 effective
10 vaccine

Output File输出文件

7 effective B
10 less O
11 converted I
13 fusion I
19 less B

Here, less appears twice with different tags as it was there in data file.在这里, less出现了两次,带有不同的标签,就像它在数据文件中一样。 Hope this is what you were looking for.希望这就是你要找的。

Assuming that spacing in your text files remain consistent假设文本文件中的间距保持一致

import re

#open your files
text_file1 = open('txt file 1.txt', 'r')
text_file2 = open('txt file 2.txt', 'r')
#save each line content in a list like l = [[id, word, tag]]
text_file_1_list = [l.strip('\n') for l in text_file1.readlines()]
text_file_1_list = [" ".join(re.split("\s+", l, flags=re.UNICODE)).split('') for l in text_file_1_list] 
#similarly save all the words in text file in list
text_file_2_list = [l.strip('\n') for l in text_file2.readlines()]
text_file_2_list = [" ".join(re.split("\s+", l, flags=re.UNICODE)).split(' ')[1] for l in text_file_2_list]
print(text_file_2_list)  
# Now just simple search algo btw these two list
words_found = [[l[1], l[2]] for l in text_file_1_list if l[1] in text_file_2_list]
print(words_found)
# [['effective', 'O']]

I think should work.我认为应该工作。

You can't find the occurences of a indicated sentence beacuse you are looking at using the index of the sentence when comparing.您无法找到指定句子的出现次数,因为您在比较时正在使用该句子的索引。 Thus one sentence in the second file is presente in the first only when it compare with the same index like so因此,只有在与相同的索引进行比较时,第二个文件中的一个句子才会出现在第一个文件中

#file1
3 make tag
7 split tag

#file2
4 make 
6 split

You are comaring them in the following way if line == part : but obviously 4 make is not equal 3 make tag because you have 3 instead of 4 and in addition the tag part that will make fail the condition. if line == part您将以以下方式对它们进行组合:但显然4 make不等于3 make 标记,因为您有3而不是4 ,此外还有将使条件失败的标记部分

So simply changing the conditional you will can retrive the right sentences.因此,只需更改条件,您就可以检索正确的句子。

def selectSentence(string):
  """Based on the strings that you have in the example. 
  I assume that the elements are separated by one space char
  and that in the sentences aren't spaces"""
  elements = string.split(" ")
  return elements[1].strip()

txtfile1 = open("file1.txt").read().split('\n')
with open ('file2.txt','r') as txtfile2:

   whatineed=[]
   for line in txtfile2:
       for part in txtfile1:
         if selectSentence(line) == selectSentence(part): 
            whatineed.append(part)

print(whatineed)

My approach我的方法

Like @Rory Daulton pointend your file is very big so is a bad idea to load it all in the memory.像@Rory Daulton 一样,你的文件非常大,所以将它全部加载到内存中是个坏主意。 A better idea is to iterate over it, while you can store the needed data of the little file (the second one).一个更好的主意是迭代它,同时您可以存储小文件(第二个)所需的数据。

txtfile2 = open("file2.txt").read().split('\n')
sentences_inf2 = {selectSentence(line) for line in txtfile2} #set to remove duplicates
with open ('file1.txt','r') as txtfile1:

   whatineed=[]
   for line in txtfile1:
         if selectSentence(line) in sentences_inf2: 
            whatineed.append(line.strip())

print(whatineed) #['7 effective O']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM