繁体   English   中英

Python-计算与特定单词之间的距离不超过三(3)个单词的单词

[英]Python - count words not in a distance of three (3) words from specific words

我正在使用以下Python代码对count words in text (.txt) files进行count words in text (.txt) files ,检查文本文件中的任何单词是否属于two lists of words that I am considering (the word lists are .csv files, imagine these as "dictionaries")two lists of words that I am considering (the word lists are .csv files, imagine these as "dictionaries")中的任何一个two lists of words that I am considering (the word lists are .csv files, imagine these as "dictionaries")

import re
import collections
from collections import Counter
import csv
import sys

find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)').findall
wanted1 = set(find_words(open('word_list1.csv').read().lower()))
wanted2 = set(find_words(open('word_list2.csv').read().lower()))
   for f in sys.argv[1:]:
    cnt1 = cnt2 = cntWords = 0
    WANTED = 20
    with open(f) as inputfile:
        for line in inputfile:
            for word in find_words(line.lower()):
                myfile.write(word+ "\n")
                cntWords += 1
                if word in wanted1:
                    file1.write(word+ "\n")
                    cnt1 += 1
                if word in wanted2:
                    file2.write(word+ "\n")
                    cnt2 += 1   

At the moment, I am counting every word in the .txt file中的At the moment, I am counting every word in the .txt file它们恰好belong in the word lists wanted1 and wanted2.

我想要的是only when there is no negator in a distance of three words from these words.这些单词进行计数only when there is no negator in a distance of three words from these words.

否定词是any one of the following three words: no, not, never.

在这种情况下, if a negator is in the distance [-3,+3] words from the word I am examining, the word should not be counted even if it belongs in one of the word lists I am examining.

任何想法如何在我的代码中实现这一点? 谢谢。

范例1:

Word-2 Word-1 Word0 Word1 Word2 not Word3 Word4 Word5 Word6 Word7 > Word0到Word5的任何一个都不应该计数器,Word-2,Word-1,Word6,Word7应该计数(如果它们属于csv单词列表) )。 代替“不是”,它可以是“从不”或“否”。

never Word-2 Word-1 Word0 Word1 Word2never Word-2 Word-1 Word0 Word1 Word2 > Word-2 Word-1 Word0不计,Word1 Word2应计(如果它们属于csv单词列表)。 代替“从不”,它可以是“不是”或“否”。

我整理了一个小脚本来完成与您所要求的类似的操作。 我将.txt文件的内容实现为多行字符串,并对单词列表进行了硬编码,只是为了简化本示例。 您可以将这些位替换为文件打开/读取代码。 这可能是一个效率很低的解决方案,但这是在脑海中组织起来的最清晰的方法。 随时根据自己的喜好进行优化。

# -*- coding: utf-8 -*-
import pprint
from collections import defaultdict
from string import punctuation

# Get a word count for each word in a pair of wordlists that appear in a block of text.
# Exclude the appearance of a word from the count if any of the 3 words before or after the word
# in question are a member of the negator set (no, not, never).

def main():
    wordlist1 = ['mickey', 'pluto', 'goofy', 'minnie', 'donald']
    wordlist2 = ['bugs', 'daffy', 'elmer', 'foghorn', 'porky']

    # Whether to ensure the words in the wordlists are lowercase depends on your use-case
    wordlist1 = [element.lower() for element in wordlist1]
    wordlist2 = [element.lower() for element in wordlist2]
    mergedwordset = set(wordlist1 + wordlist2)
    negatorset = set(['no', 'not', 'never'])

    # Using collections.defaultdict here so that we can add a key with the value of 1 
    # if it doesn't already exist and increment the value of the key if it does exist.
    countincludingneg = defaultdict(int)
    countexcludingneg = defaultdict(int)

    # Using a multi-line string here just to simplify this example.
    # This will be parsed for the word count.
    # Adapt it to your own uses.
    # Text excerpts from wikipedia:
    # http://en.wikipedia.org/wiki/Pluto_(Disney)
    textblock = '''
Pluto, also called Pluto the Pup, is a cartoon character created in 1930 by Walt Disney Productions. He is a red-colored, medium-sized, short-haired dog with black ears. Unlike most Disney characters, Pluto is not anthropomorphic beyond some characteristics such as facial expression, though he did speak for a short portion of his history. He is Mickey Mouse's pet. Officially a mixed-breed dog, he made his debut as a bloodhound in the Mickey Mouse cartoon The Chain Gang. Together with Mickey Mouse, Minnie Mouse, Donald Duck, Daisy Duck, and Goofy, Pluto is one of the "Sensational Six"—the biggest stars in the Disney universe. Though all six are non-human animals, Pluto alone is not dressed as a human.
Pluto debuted in animated cartoons and appeared in 24 Mickey Mouse films before receiving his own series in 1937. All together Pluto appeared in 89 short films between 1930 and 1953. Several of these were nominated for an Academy Award, including The Pointer (1939), Squatter's Rights (1946), Pluto's Blue Note (1947), and Mickey and the Seal (1948). One of his films, Lend a Paw (1941), won the award in 1942. Because Pluto does not speak, his films generally rely on physical humor. This made Pluto a pioneering figure in character animation, which is expressing personality through animation rather than dialogue.
Like all of Pluto's co-stars, the dog has appeared extensively in comics over the years, first making an appearance in 1931. He returned to theatrical animation in 1990 with The Prince and the Pauper and has also appeared in several direct-to-video films. Pluto also appears in the television series Mickey Mouse Works (1999–2000), House of Mouse (2001–2003), and Mickey Mouse Clubhouse (2006–2013).
In 1998, Disney's copyright on Pluto, set to expire in several years, was extended by the passage of the Sonny Bono Copyright Term Extension Act. Disney, along with other studios, lobbied for passage of the act to preserve their copyrights on characters such as Pluto for 20 additional years.
Pluto first and most often appears in the Mickey Mouse series of cartoons. On rare occasions he is paired with Donald Duck ("Donald and Pluto", "Beach Picnic", "Window Cleaners", "The Eyes Have It", "Donald's Dog Laundry", & "Put Put Troubles").
The first cartoons to feature Pluto as a solo star were two Silly Symphonies, Just Dogs (1932) and Mother Pluto (1936). In 1937, Pluto appeared in Pluto's Quin-Puplets which was the first instalment of his own film series, then headlined Pluto the Pup. However, they were not produced on a regular basis until 1940, by which time the name of the series was shortened to Pluto.
His first comics appearance was in the Mickey Mouse daily strips in 1931 two months after the release of The Moose Hunt. Pluto Saves the Ship, a comic book published in 1942, was one of the first Disney comics prepared for publication outside newspaper strips. However, not counting a few cereal give-away mini-comics in 1947 and 1951, he did not have his own comics title until 1952.
In 1936 Pluto got an early title feature in a picture book under title "Mickey Mouse and Pluto the Pup" by Whitman Publishing.
Pluto runs his own neighborhood in Disney's Toontown Online. It's called the Brrrgh and it's always snowing there except during Halloween. During April Toons Week, a weekly event that is very silly, Pluto switches playgrounds with Minnie (all other characters do this as well). Pluto actually talks in Minnie's Melodyland.
Pluto has also appeared in the television series Mickey Mouse Works (1999–2000), Disney's House of Mouse (2001–2003) and Mickey Mouse Clubhouse (2006–present). Curiously enough, however, Pluto was the only standard Disney character not included when the whole gang was reunited for the 1983 featurette Mickey's Christmas Carol, although he did return in The Prince and the Pauper (1990) and Runaway Brain (1995). He also had a cameo in Who Framed Roger Rabbit (1988). In 1996, he made a cameo in the Quack Pack episode "The Really Mighty Ducks".
'''
    # Removing leading and trailing whitespace.
    # Removing new-lines so we can extend the look-aheads / look-behinds across lines.
    # Removing punctuation.
    # Setting all text to lowercase
    # Adjust to your use-cases
    textblock = textblock.strip().replace('\n', ' ').translate(None, punctuation).lower()
    textblockwords = textblock.split()

    # Construct a list of 7-gram (or less) word windows.
    # The window will center on each individual word of the textblock
    # and include the 3 words before and after its appearance.
    windows = n_gram_word_windows(textblockwords, 3)

    # Un-comment the following line if you'd like to see a representation of the n-gram word windows
    #pprint.pprint(windows)


    for windowdict in windows:
        for key, ngramlist in windowdict.iteritems():
            # Is the word a member of the wordlists?
            if key in mergedwordset:
                countincludingneg[key] += 1
                # Do the words preceeding or following appear in the set of negators?
                if len(negatorset.intersection(set(ngramlist))) == 0:
                    countexcludingneg[key] += 1
    print "Count including negators"
    pprint.pprint(countincludingneg)
    print "Count excluding negators"
    pprint.pprint(countexcludingneg)



# The idea here is to examine each word in the textblock and
# create a list containing the 3 words before the word, the word itself, and the 3 words following the word.
# This method will return a list of dictionaries.
# The dictionary will be comprised of the examined word as the key, and its n-gram word window as the value.
def n_gram_word_windows(textlist, lookaheadbehind=3):
    wordwindows = []
    for index, item in enumerate(textlist):
        intermediatelist = []
        if index < lookaheadbehind:
            for preceedingword in textlist[:index]:
                intermediatelist.append(preceedingword)
        else:
            for preceedingword in textlist[index-lookaheadbehind:index]:
                intermediatelist.append(preceedingword)
        if index < len(textlist):
            for lookaheadword in textlist[index:index+lookaheadbehind+1]:
                intermediatelist.append(lookaheadword)
        wordwindows.append({item: intermediatelist})
    return wordwindows


if __name__ == '__main__':
    main()

结果如下所示:

macbook:stackoverflow joeyoung$ python negatorparser.py 
Count including negators
defaultdict(<type 'int'>, {'mickey': 12, 'donald': 3, 'goofy': 1, 'minnie': 2, 'pluto': 27})
Count excluding negators
defaultdict(<type 'int'>, {'mickey': 12, 'donald': 3, 'goofy': 1, 'minnie': 2, 'pluto': 24})

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM