如何使用python在文本文件包中為每個文本文件找到唯一的單詞？

Question

如何僅查找文本文件特有的單詞？ 如果其他文件中經常使用該單詞，則該單詞將被刪除。

這是參考http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html

我需要一個腳本，該腳本循環遍歷文件夾中的所有文本文件，並以Json格式輸出結果。

到目前為止我的代碼：

from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from pprint import pprint as pp
from glob import glob
from nltk import word_tokenize
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import CountVectorizer
import codecs
import jinja2
import json
import os


def get_raw_data():
    texts = []
    for x in range(1,95):
        file_name = str(x+1)+".txt"

        with codecs.open(file_name,"rU","utf-8") as myfile:
            data = myfile.read()

    texts.append(data)
    yield file_name, '\n'.join(texts)


class StemTokenizer(object):
    def __init__(self):
        self.ignore_set = {'footnote'}

    def __call__(self, doc):
        words = []
        for word in word_tokenize(doc):
            word = word.lower()
            w = wn.morphy(word)
            if w and len(w) > 1 and w not in self.ignore_set:
                words.append(w)
        return words


def process_text(counts, vectorizer, text, file_name, index):
    result = {w: counts[index][vectorizer.vocabulary_.get(w)]
              for w in vectorizer.get_feature_names()}

    result = {w: c for w, c in result.iteritems() if c > 4}
    normalizing_factor = max(c for c in result.itervalues())

    result = {w: c / normalizing_factor
              for w, c in result.iteritems()}

    return result


def main():
    data = list(get_raw_data())
    print('Data loaded')
    n = len(data)

    vectorizer = CountVectorizer(stop_words='english', min_df=(n-1) / n,tokenizer=StemTokenizer())

    counts = vectorizer.fit_transform(text for p, text in data).toarray()

    print('Vectorization done.')
    print (counts)

    for x in range(95):
        file_name = str(x+1)+".txt"

            # print (text)
        for i, (text) in enumerate(data):
            print (file_name)
            # print (text)
            with codecs.open(file_name,"rU","utf-8") as myfile:
                text = myfile.read()
            result = process_text(counts, vectorizer, text, file_name, i)
            print (result)  

if __name__ == '__main__':
    main()

Answer 1

看起來您有一堆名為1.txt ， 2.txt ，... 95.txt的文件，並且您只想查找一個文件中出現的單詞。 我只收集所有單詞，計算每個單詞出現在多少文件中； 並打印出單例。

from collections import Counter
import re

fileids = [ str(n+1)+".txt" for n in range(95) ]
filecounts = Counter()

for fname in fileids:
    with open(fname) as fp:    # Add encoding if really needed
        text = fp.read().lower()
        words = re.split(r"\W+", text)  # Keep letters, drop the rest
        filecounts.update(set(words))

singletons = [ word in filecounts if filecounts[word] == 1 ]
print(" ".join(singletons))

完成。 您不需要scikit，不需要nltk，也不需要一堆IR算法。 您可以在IR算法中使用單例列表，但這是另一回事。

Answer 2

def parseText():

    # oFile: text file to test
    # myWord: word we are looking for

    # Get all lines into list
    aLines = oFile.readlines()

    # Perform list comprehension on lines to test if the word is found
    for sLine in aLines:

        # Parse the line (remove spaces), returns list
        aLine = sLine.split()

        # Iterate words and test to see if they match our word
        for sWord in aLines:
            # if it matches, append it to our list
            if sWord == myWord: aWords.append( sWord )



# Create empty list to store all instances of the word that we may find
aWords = []

# Prompt user to know what word to search
myWord = str( raw_input( 'what word to searh:' ) )

# Call function
parseText()

# Check if list has at least one element
if len( aWords ) < 1: print 'Word not found in file'
else: print str( len( aWords ) ) + ' instances of our word found in file'

如何使用python在文本文件包中為每個文本文件找到唯一的單詞？

問題描述

2 個解決方案

解決方案1
1 2015-05-04 17:12:46

解決方案2
0 2015-05-03 07:06:02

如何使用python在文本文件包中為每個文本文件找到唯一的單詞？

問題描述

2 個解決方案

解決方案1 1 2015-05-04 17:12:46

解決方案2 0 2015-05-03 07:06:02

解決方案1
1 2015-05-04 17:12:46

解決方案2
0 2015-05-03 07:06:02