使用目錄作為python`textblob`的tf-idf的輸入

Question

我正在嘗試修改此代碼（在此處找到源代碼）以遍歷文件目錄，而不是對輸入進行硬編碼。

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import division, unicode_literals
import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)


document1 = tb("""Today, the weather is 30 degrees in Celcius. It is really hot""")

document2 = tb("""I can't believe the traffic headed to the beach. It is really a circus out there.'""")

document3 = tb("""There are so many tolls on this road. I recommend taking the interstate.""")

bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
    print("Document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words:
        score_weight = score * 100 
        print("\t{}, {}".format(word, round(score_weight, 5)))

我想在目錄中使用輸入txt文件，而不是每個硬編碼的document 。

例如，假設我有一個目錄foo ，其中包含三個文件file1 ， file2和file3 。

文件1包含document1包含的內容，即

文件1：

Today, the weather is 30 degrees in Celcius. It is really hot

文件2包含document2包含的內容，即

I can't believe the traffic headed to the beach. It is really a circus out there.

文件3包含document3包含的內容，即

There are so many tolls on this road. I recommend taking the interstate.

我必須使用glob來達到我想要的結果，並且我想出了以下代碼修改，它可以正確識別文件，但不會像原始代碼那樣單獨處理它們：

file_names = glob.glob("/path/to/foo/*")
files =  map(open,file_names)
documents = [file.read() for file in files]
[file.close() for file in files]


bloblist = [documents]
for i, blob in enumerate(bloblist):
    print("Document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words:
        score_weight = score * 100 
        print("\t{}, {}".format(word, round(score_weight, 5)))

如何使用glob維護每個文件的分數？

將目錄中的文件用作輸入后，期望的結果將與原始代碼相同（結果被截斷為前3個空格）：

Document 1
    Celcius, 3.37888
    30, 3.37888
    hot, 3.37888
Document 2
    there, 2.38509
    out, 2.38509
    headed, 2.38509
Document 3
    on, 3.11896
    this, 3.11896
    many, 3.11896

這里的類似問題並未完全解決問題。 我想知道如何調用文件來計算idf但分別維護它們以計算完整的tf-idf ？

Answer 1

@AnnaBonazzi在此處提供了代碼段， https: //gist.github.com/sloria/6407257，

import os, glob
folder = "/path/to/folder/"
os.chdir(folder)
files = glob.glob("*.txt") # Makes a list of all files in folder
bloblist = []
for file1 in files:
    with open (file1, 'r') as f:
        data = f.read() # Reads document content into a string
        document = tb(data.decode("utf-8")) # Makes TextBlob object
        bloblist.append(document)

我對其進行了修改（Python 3）：

import os, glob
bloblist = []

def make_corpus(input_dir):
    """ Based on code snippet from https://gist.github.com/sloria/6407257 """

    global doc                              ## used outside this method
    input_folder = "input"
    os.chdir(input_folder)
    files = glob.glob("*.*")                ## or "*.txt", etc.
    for doc in files:
        # print('doc:', doc)                ## prints filename (doc)
        with open (doc, 'r') as f:
            data = f.read()                 ## read document content into a string
            document = tb(data)             ## make TextBlob object
            bloblist.append(document)
    # print('bloblist:\n', bloblist)        ## copious output ...
    print('len(bloblist):', len(bloblist))


make_corpus('input')                        ## input directory 'input'

更新1：

我個人除了使用Python glob模塊外沒有其他困難，因為我經常（i）文件名不帶擴展名（例如01），並且（ii）想遞歸嵌套目錄。

乍一看，“全局”方法似乎是一個簡單的解決方案。 但是，當嘗試遍歷glob返回的文件時，我經常遇到錯誤（例如）

IsADirectoryError: [Errno 21] Is a directory: ...

當循環遇到glob返回的目錄（而不是文件）名稱時。

我認為，只需付出一點點努力，以下方法就會更加健壯：

import os
bloblist = []

def make_corpus(input_dir):
    for root, subdirs, files in os.walk(input_dir):
        for filename in files:
            f = os.path.join(root, filename)
            print('file:', f)
            with open(os.path.join(root, filename)) as f:
                for line in f:
                    # print(line, end='')
                    bloblist.append(line)
    # print('bloblist:\n', bloblist)
    print('len(bloblist):', len(bloblist), '\n')

make_corpus('input')       ## 'input' = input dir

更新2：

最后一種方法（Linux shell find命令，適用於Python 3）：

import sh     ## pip install sh

def make_corpus(input_dir):
    '''find (here) matches filenames, excludes directory names'''

    corpus = []
    file_list = []
    #FILES = sh.find(input_dir, '-type', 'f', '-iname', '*.txt')    ## find all .txt files
    FILES = sh.find(input_dir, '-type', 'f', '-iname', '*')         ## find any file
    print('FILES:', FILES)                                          ## caveat: files in FILES are '\n'-terminated ...
    for filename in FILES:
        #print(filename, end='')
        # file_list.append(filename)                                ## when printed, each filename ends with '\n'
        filename = filename.rstrip('\n')                            ## ... this addresses that issue
        file_list.append(filename)
        with open(filename) as f:
            #print('file:', filename)
            # ----------------------------------------
            # for general use:
            #for line in f:
                #print(line)
                #corpus.append(line)
            # ----------------------------------------
            # for this particular example (Question, above):
            data = f.read()
            document = tb(data)
            corpus.append(document)
    print('file_list:', file_list)
    print('corpus length (lines):', len(corpus))

    with open('output/corpus', 'w') as f:                           ## write to file
        for line in corpus:
            f.write(line)

Answer 2

在第一個代碼示例中，用tb()結果填充bloblist ，在第二個示例中，用tb()輸入（僅字符串）填充。

嘗試將bloblist = [documents]替換為bloblist = map(tb, documents) 。

您還可以像這樣對文件名列表file_names = sorted(glob.glob("/path/to/foo/*"))以使兩個版本的輸出匹配。

Answer 3

我不確定您要實現的目標到底是什么。 您可以有一個數組並將結果附加到該數組：

scores = []
bloblist = [documents]
for i, blob in enumerate(bloblist):
  ... do your evaluation ..
  scores.append(score_weight)

print scores

使用目錄作為python`textblob`的tf-idf的輸入

問題描述

3 個解決方案

解決方案1
1 2017-12-06 18:23:30

解決方案2
0 已采納 2015-12-10 12:43:10

解決方案3
0 2015-12-10 12:44:23

使用目錄作為python`textblob`的tf-idf的輸入

問題描述

3 個解決方案

解決方案1 1 2017-12-06 18:23:30

解決方案2 0 已采納 2015-12-10 12:43:10

解決方案3 0 2015-12-10 12:44:23

解決方案1
1 2017-12-06 18:23:30

解決方案2
0 已采納 2015-12-10 12:43:10

解決方案3
0 2015-12-10 12:44:23