简体   繁体   English

python:在不同文件中搜索单词

[英]python: searching words in different files

I'm trying to write a script that collects the names of all files in a directory and then search each of them for given words. 我正在尝试编写一个脚本,该脚本收集目录中所有文件的名称,然后在每个文件中搜索给定的单词。 Each time the word is found, the name of the file and the full line that contains that word should be printed. 每次找到该单词时,都应打印文件名和包含该单词的整行。 Additionally, in a new file I want to print the number of times the word has been found. 另外,在一个新文件中,我要打印找到单词的次数。

This is what I have so far: 这是我到目前为止的内容:

import os

print(os.listdir('./texts'), '\n\n\n')

suchwort ={"computational":0,"linguistics":0,"processing":0,"chunking":0,"coreference":0,"html":0,"machine":0}
hitlist = './hits.txt'


with open(hitlist, 'a+') as hits:
   for elem in os.listdir('./texts'):
      with open(os.path.join("./texts",elem)) as fh:
         for line in fh:
            words = line.split(' ')
            print(elem, " : ",line)
            for n in words:
               if n in suchwort:
                  if n in suchwort.keys():
                     suchwort[n]+=1
                  else:
                     suchwort[n]=1
   for k in suchwort:
      print(k,":",suchwort[k],file=hits)

The result in the new file (hits.txt) is: 新文件(hits.txt)中的结果为:

chunking : 0
machine : 9
html : 0
processing : 4
linguistics : 12
coreference : 1
computational : 12

However the values seem to be wrong, because the word "html" is in one of the files. 但是,值似乎是错误的,因为其中一个文件中包含单词“ html”。

import itertools
import multiprocessing as mp
import glob

def filesearcher(qIn, qOut):
    for fpath in iter(qIn.get, None):
        keywords = {"computational":{'count':0, 'lines':[]},
                    "linguistics":{'count':0, 'lines':[]},
                    "processing":{'count':0, 'lines':[]},
                    "chunking":{'count':0, 'lines':[]},
                    "coreference":{'count':0, 'lines':[]},
                    "html":{'count':0, 'lines':[]},
                    "machine":{'count':0, 'lines':[]}}

        with open(fpath) as infile:
           for line in file:
               for word in line.split():
                   word = word.lower()
                   if word not in keywords: continue
                   keywords[word]['count'] += 1
                   keywords[word]['lines'].append(line)
        qOut.put(fpath, keywords)
    qOut.put(None)


def main():
    numProcs = 4  # fiddle to taste
    qIn, qOut = [mp.Queue() for _ in range(2)]
    procs = [mp.Process(target=filesearcher, args=(qIn, qOut)) for _ in range(numProcs)]
    for p in procs: p.start()
    for fpath in glob.glob('./texts/*'): qIn.put(fpath)
    for _ in procs: qIn.put(None)

    done = 0
    while done < numProcs:
        d = qOut.get()
        if d is None:
            done += 1
            continue
        fpath, stats = d
        print("showing results for", fpath)
        for word, d in stats.items():
            print(word, ":", d['count'])
            for line in d['lines']:
                print('\t', line)

    for p in procs: p.terminate()

The problem is caused by the way the file is iterated line-by-line. 问题是由逐行迭代文件的方式引起的。 In the snippet below, each "line" will have the trailing newline. 在下面的代码段中,每条“行”都将包含尾随的换行符。 So doing a split leave the last line with the trailing newline character. 因此,进行拆分时,请在最后一行保留尾随换行符。

  with open(os.path.join("./texts",elem)) as fh:
     for line in fh:
        words = line.split(' ')

If you print the "repr" of the words, 如果您打印单词的“ repr”,

print repr(words)

You'd see that the last word contains the trailing newline, 您会看到最后一个单词包含结尾的换行符,

['other', 'word\n']

instead of the expected: 而不是预期的:

['other', 'word']

To solve this problem you could use "strip", before processing each line: 要解决此问题,可以在处理每一行之前使用“条带”:

line = line.strip() 

to remove trailing and leading whitespaces of the string. 删除字符串的尾部和前导空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM