简体   繁体   English

从文本文件中随机选择句子,查找Coresponding ID号

[英]Randomly Select Sentences from Text File, Find Coresponding ID Number

I am helping a professor of mine with a research project that involves pulling one thousand sentences randomly from a set of 20 text files. 我正在帮助我的一位教授进行一项研究项目,该项目涉及从一组20个文本文件中随机抽取一千个句子。 This is all data from the Corpus of Contemporary American English, if anyone is familiar with working with that. 这是来自当代美国英语语料库的所有数据,如果有人熟悉使用它。 In these text files, the data is arranged like so: 在这些文本文件中,数据的排列方式如下:

Blockquote ##4000348 I must begin by saying this : In preparation for this lecture , I read ( or in some cases reread ) a number of the writings of Sidney Hook . Blockquote ## 4000348我必须首先这样说:为了准备这个讲座,我阅读(或在某些情况下重读)Sidney Hook的一些着作。 I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook . 我只是为了给Sidney Hook的演讲而给我一个正确的起点。 But instead I found myself infused with a set of ideas that were relevant to a different setting , a different occasion . 但相反,我发现自己注入了一系列与不同环境,不同场合相关的想法。

##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College . ## 4000349我想我最为人所知的是我的智慧和学习,但事实上我的成名来自于我是一位知名的保守派,同时也是耶鲁大学的院长。 That was the reason news of my appointment appeared in the Wall Street Journal and the National Review , which does n't usually happen to deans of Yale College , and does n't help them much when it does . 这就是我被任命的消息出现在“华尔街日报”和“国家评论”中的原因,这种情况通常不会发生在耶鲁大学的院长身上,并且在这种情况下并没有多大帮助。

Blockquote> 大段引用>

So, there are hundreds of paragraphs, each starting with a six digit number preceded by "##". 因此,有数百个段落,每个段落以一个前面带有“##”的六位数字开头。 That number corresponds to the source where the sentences were drawn from. 该数字对应于句子的来源。 I need to pull random sentences from these files, and also get the six digit number identifying their source with them. 我需要从这些文件中提取随机句子,并获得六位数字来识别它们的来源。 So ideally, I would get something like: 理想情况下,我会得到类似的东西:

Blockquote ##4000348 I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook Blockquote ## 4000348我只是为了给Sidney Hook演讲而给我一个正确的起点

##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College . ## 4000349我想我最为人所知的是我的智慧和学习,但事实上我的成名来自于我是一位知名的保守派,同时也是耶鲁大学的院长。

I have succeeded in getting random sentences from the files (with some help from the kind souls here at stackoverflow), but I don't know how to get the number attached to them (for example, if I pull a sentence from the middle of a paragraph, how would I be able to get the number from the start of the paragraph). 我已经成功地从文件中获取随机句子(在stackoverflow处有一些善意的帮助),但我不知道如何将数字附加到它们上(例如,如果我从中间拉出一个句子)一段,我怎么能从段落的开头得到数字)。 Can anyone help me think of a way to do this? 任何人都可以帮我想办法吗? This is the code I have so far, which successfully extracts sentences. 这是我到目前为止的代码,它成功地提取了句子。

# -*- coding: utf-8 -*-

import re
from random import sample

sentences = []
for i in range(1990,2013):
    with open('w_acad_{}.txt'.format(i)) as f:
        sentences += re.findall(r".*?[\.\!\?]+", f.read())

selected = sample(sentences, 2000)
with open('out.txt', 'w') as f:
    f.write('\n'.join(selected))

Perhaps you could use regex to extract each paragraph along with it's source id, and then extract sentences from the paragraph, similarly to how you're doing it at the moment. 也许您可以使用正则表达式提取每个段落及其源ID,然后从段落中提取句子,类似于您此刻的操作方式。 This should help you catch the paragraph: 这应该可以帮助你抓住这段:

# with open... etc.
for source_id, paragraph in re.findall(r"(##\d+)([^#]+)", f.read()):
    sentences += [(source_id, sentence) for sentence in re.findall(r".*?[\.\!\?]+", paragraph)]

Now, sentences should be a list of tuples like ('##123', 'A sentence.') , from which you can sample same as before. 现在, sentences应该是一个元组列表,如('##123', 'A sentence.') ,您可以从中获得与之前相同的元素。

In general, to avoid loading (potentially large) files into memory all at once, you could use a reservoir sampling algorithm -- just pass it an iterator that yields labeled (with the ## -numbers) sentences: 通常,为了避免一次性将(可能很大的)文件加载到内存中,您可以使用储存器采样算法 - 只需将其传递给生成带标签(带## -numbers)句子的迭代器:

#!/usr/bin/env python
import re
import nltk  # $ pip install nltk

def paragraphs(file):
    """Yield blank-line separated paragraphs labeled with ##-numbers."""
    lines = []
    for line in file:
        if line.strip():
            lines.append(line)
        elif lines:  # blank line, the end of a non-empty paragraph
            paragraph = ''.join(lines)
            numbers = re.findall(r'##([0-9]+)', paragraph)  # only ASCII-digits
            assert len(numbers) == 1  # only one ##-number per paragraph
            yield int(numbers[0]), paragraph
            del lines[:]

def sentences(filenames):
    for filename in filenames:
        with open(filename) as file:
            for number, paragraph in paragraphs(file):
                for sentence in nltk.sent_tokenize(paragraph):
                    yield number, sentence

filenames = ('w_acad_%d.txt' % n for n in range(1990, 2013))
print(reservoir_sample(sentences(filenames), 2000))

where reservoir_sample() is defined here . 这里定义了reservoir_sample()

nltk.sent_tokenize() may be a more robust solution than the r".*?[\\.\\!\\?]+" regular expression. nltk.sent_tokenize()可能是一个比r".*?[\\.\\!\\?]+"正则表达式更强大的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM