簡體   English   中英

如何使用python在文本文件包中為每個文本文件找到唯一的單詞?

[英]How to find unique words for each text file in a bundle of text files using python?

如何僅查找文本文件特有的單詞? 如果其他文件中經常使用該單詞,則該單詞將被刪除。

這是參考http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html

我需要一個腳本,該腳本循環遍歷文件夾中的所有文本文件,並以Json格式輸出結果。

到目前為止我的代碼:

from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from pprint import pprint as pp
from glob import glob
from nltk import word_tokenize
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import CountVectorizer
import codecs
import jinja2
import json
import os


def get_raw_data():
    texts = []
    for x in range(1,95):
        file_name = str(x+1)+".txt"

        with codecs.open(file_name,"rU","utf-8") as myfile:
            data = myfile.read()

    texts.append(data)
    yield file_name, '\n'.join(texts)


class StemTokenizer(object):
    def __init__(self):
        self.ignore_set = {'footnote'}

    def __call__(self, doc):
        words = []
        for word in word_tokenize(doc):
            word = word.lower()
            w = wn.morphy(word)
            if w and len(w) > 1 and w not in self.ignore_set:
                words.append(w)
        return words


def process_text(counts, vectorizer, text, file_name, index):
    result = {w: counts[index][vectorizer.vocabulary_.get(w)]
              for w in vectorizer.get_feature_names()}

    result = {w: c for w, c in result.iteritems() if c > 4}
    normalizing_factor = max(c for c in result.itervalues())

    result = {w: c / normalizing_factor
              for w, c in result.iteritems()}

    return result


def main():
    data = list(get_raw_data())
    print('Data loaded')
    n = len(data)

    vectorizer = CountVectorizer(stop_words='english', min_df=(n-1) / n,tokenizer=StemTokenizer())

    counts = vectorizer.fit_transform(text for p, text in data).toarray()

    print('Vectorization done.')
    print (counts)

    for x in range(95):
        file_name = str(x+1)+".txt"

            # print (text)
        for i, (text) in enumerate(data):
            print (file_name)
            # print (text)
            with codecs.open(file_name,"rU","utf-8") as myfile:
                text = myfile.read()
            result = process_text(counts, vectorizer, text, file_name, i)
            print (result)  

if __name__ == '__main__':
    main()

看起來您有一堆名為1.txt2.txt ,... 95.txt的文件,並且您只想查找一個文件中出現的單詞。 我只收集所有單詞,計算每個單詞出現在多少文件中; 並打印出單例。

from collections import Counter
import re

fileids = [ str(n+1)+".txt" for n in range(95) ]
filecounts = Counter()

for fname in fileids:
    with open(fname) as fp:    # Add encoding if really needed
        text = fp.read().lower()
        words = re.split(r"\W+", text)  # Keep letters, drop the rest
        filecounts.update(set(words))

singletons = [ word in filecounts if filecounts[word] == 1 ]
print(" ".join(singletons))

完成。 您不需要scikit,不需要nltk,也不需要一堆IR算法。 您可以在IR算法中使用單例列表,但這是另一回事。

def parseText():

    # oFile: text file to test
    # myWord: word we are looking for

    # Get all lines into list
    aLines = oFile.readlines()

    # Perform list comprehension on lines to test if the word is found
    for sLine in aLines:

        # Parse the line (remove spaces), returns list
        aLine = sLine.split()

        # Iterate words and test to see if they match our word
        for sWord in aLines:
            # if it matches, append it to our list
            if sWord == myWord: aWords.append( sWord )



# Create empty list to store all instances of the word that we may find
aWords = []

# Prompt user to know what word to search
myWord = str( raw_input( 'what word to searh:' ) )

# Call function
parseText()

# Check if list has at least one element
if len( aWords ) < 1: print 'Word not found in file'
else: print str( len( aWords ) ) + ' instances of our word found in file'

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM