使用Python从文本文件中创建n个单词的（随机）样本

Question

对于我的博士项目，我正在评估荷兰语的所有现有命名实体识别标记器。 为了检查那些标记的精确度和召回率，我想手动注释来自语料库的随机样本中的所有命名实体。 手动注释的样本将作为“黄金标准”，我将比较不同标记的结果。

我的语料库由170部荷兰小说组成。 我正在编写一个Python脚本来为每个小说生成一个特定数量单词的随机样本（我将在之后用它来注释）。 所有小说都将存储在同一目录中。 以下脚本旨在为该目录中的每个小说生成一个随机的n行样本：

import random
import os
import glob
import sys
import errno

path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)  

for text in files:
    try:
        with open(text, 'rt', encoding='utf-8') as f:
             # number of lines from txt file
             random_sample_input = random.sample(f.readlines(),100) 

    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR: 
            raise 


# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8') 
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()

这段代码有两个问题：

目前，我已在目录中放置了两个小说（.txt文件）。 但是代码只为每个小说中的一个输出随机样本。
目前，代码从每个.txt文件中抽取随机数量的LINES，但我更喜欢为每个.txt文件生成随机数量的WORDS。 理想情况下，我想生成170个.txt文件中每个文件的第一个或最后100个单词的样本。 在这种情况下，样本根本不是随机的; 但到目前为止，我找不到一种不使用随机库来创建样本的方法。

任何人都可以提出如何解决这两个问题的建议吗？ 我仍然是Python和编程的新手（我是一名文学学者），所以我很乐意学习不同的方法。 提前谢谢了！

Answer 1

一些建议：

随机句子，而不是单词或行。 如果输入是语法句子，NE标记符将更好地工作。 所以你需要使用一个句子分割器。

迭代文件时， random_sample_input仅包含来自最后一个文件的行。 您应该将写入所选内容的代码块移动到for循环内的文件中。 然后，您可以将选定的句子写入一个文件或单独的文件。 例如：

out = open("selected-sentences.txt", "w")

for text in files:
    try:
        with open(text, 'rt', encoding='utf-8') as f:
             sentences = sentence_splitter.split(f.read())
             for sentence in random.sample(sentences, 100):
                 print >> out, sentence

    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR: 
            raise 

out.close()

[编辑]以下是你应该如何使用NLTK句子分割器：

import nltk.data
sentence_splitter = nltk.data.load("tokenizers/punkt/dutch.pickle")
text = "Dit is de eerste zin. Dit is de tweede zin."
print sentence_splitter.tokenize(text)

打印：

["Dit is de eerste zin.", "Dit is de tweede zin."]

请注意，您需要首先使用交互式控制台中的nltk.download()下载Dutch tokenizer。

Answer 2

你只需将你的行分成单词，将它们存储在某个地方，然后在读完所有文件并存储它们的单词后，用random.sample选择100。 这就是我在下面的代码中所做的。 但是，我不太确定它是否能够处理170部小说，因为它可能会导致大量的内存使用。

import random
import os
import glob
import sys
import errno

path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)
words = []

for text in files:
    try:
        with open(text, 'rt', encoding='utf-8') as f:
             # number of lines from txt file
             for line in f:
                 for word in line.split():
                     words.append(word)

    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR: 
            raise 

random_sample_input = random.sample(words, 100)

# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8') 
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()

在上面的代码中，小说的单词越多，在输出样本中表示的可能性就越大。 这可能是也可能不是理想的行为。 如果你想让每部小说都有相同的思考，你可以选择100个单词来添加words变量，然后从那里选择100个单词。 它还具有使用更少内存的副作用，因为一次只能存储一本小说。

import random
import os
import glob
import sys
import errno

path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)
words = []

for text in files:
    try:
        novel = []
        with open(text, 'rt', encoding='utf-8') as f:
             # number of lines from txt file
             for line in f:
                 for word in line.split():
                     novel.append(word)
             words.append(random.sample(novel, 100))


    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR: 
            raise 


random_sample_input = random.sample(words, 100)

# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8') 
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()

第三个版本，这个将处理句子而不是单词，并保持标点符号。 此外，每本书在保留的最终句子上都有相同的“重量”，无论其大小如何。 请记住，句子检测是通过一种非常聪明但不可靠的算法完成的。

import random
import os
import glob
import sys
import errno
import nltk.data

path = '/home/clement/Documents/randomPythonScripts/data/*.txt'
files = glob.glob(path)

sentence_detector = nltk.data.load('tokenizers/punkt/dutch.pickle')
listOfSentences = []

for text in files:
    try:
        with open(text, 'rt', encoding='utf-8') as f:
            fullText = f.read()
        listOfSentences += [x.replace("\n", " ").replace("  "," ").strip() for x in random.sample(sentence_detector.tokenize(fullText), 30)]

    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR:
            raise

random_sample_input = random.sample(listOfSentences, 15)
print(random_sample_input)

# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8')
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()

Answer 3

这解决了两个问题：

import random
import os
import glob
import sys
import errno

path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)

with open("randomsample", "w", encoding='utf-8') as random_sample_output:
    for text in files:
        try:
            with open(text, 'rt', encoding='utf-8') as f:
                # number of lines from txt file
                random_sample_input = random.sample(f.read().split(), 10)

        except IOError as exc:
            # Do not fail if a directory is found, just ignore it.
            if exc.errno != errno.EISDIR:
            raise

        # This block of code writes the result of the previous to a new file
        random_sample_input = map(lambda x: x + "\n", random_sample_input)
        random_sample_output.writelines(random_sample_input)

使用Python从文本文件中创建n个单词的（随机）样本

问题描述

3 个解决方案

解决方案1
3 2016-10-14 10:19:26

解决方案2
2 已采纳 2016-10-14 10:16:40

解决方案3
1 2016-10-14 10:28:43

使用Python从文本文件中创建n个单词的（随机）样本

问题描述

3 个解决方案

解决方案1 3 2016-10-14 10:19:26

解决方案2 2 已采纳 2016-10-14 10:16:40

解决方案3 1 2016-10-14 10:28:43

解决方案1
3 2016-10-14 10:19:26

解决方案2
2 已采纳 2016-10-14 10:16:40

解决方案3
1 2016-10-14 10:28:43