如何將功能集成到“ Web Scraping with Python”一書中提出的代碼中

Question

我正在閱讀“使用Python進行網頁搜刮”。 在第8章中，作者介紹了ngram的示例，其中顯示了以下代碼：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
import operator

def cleanInput(input):
    input = re.sub('\n+', " ", input).lower()
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    input = bytes(input, "UTF-8")
    input = input.decode("ascii", "ignore")
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def ngrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if ngramTemp not in output:
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output

content = str(
        urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),
        'utf-8')
ngrams = ngrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key=operator.itemgetter(1),
                      reverse=True)
print(sortedNGrams)

它工作得很好，但是結果中包含了一堆沒有意義的單詞。 為了改進它，作者說可以使用一個新功能：

def isCommon(ngram):
commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it",
               "i", "that", "for", "you", "he", "with", "on", "do", "say",
               "this", "they", "is", "an", "at", "but", "we", "his",
               "from", "that", "not", "by", "she", "or", "as", "what",
               "go", "their", "can", "who", "get", "if", "would", "her",
               "all", "my", "make", "about", "know", "will", "as", "up",
               "one", "time", "has", "been", "there", "year", "so",
               "think", "when", "which", "them", "some", "me", "people",
               "take", "out", "into", "just", "see", "him", "your", "come",
               "could", "now", "than", "like", "other", "how", "then",
               "its", "our", "two", "more", "these", "want", "way", "look",
               "first", "also", "new", "because", "day", "more", "use",
               "no", "man", "find", "here", "thing", "give", "many",
               "well"]
for word in ngram:
    if word in commonWords:
        return True
return False

但是作者沒有說的是如何應用該函數來獲得書中所示的結果：

('united states', 10), ('executive department', 4), ('general governm
ent', 4), ('called upon', 3), ('government should', 3), ('whole count
ry', 3), ('mr jefferson', 3), ('chief magistrate', 3), ('same causes'
, 3), ('legislative body', 3)

有什么想法怎么做？

提前致謝。

Answer 1

這似乎產生您的輸出：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
import operator

def cleanInput(input):
    input = re.sub('\n+', " ", input).lower()
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    input = bytes(input, "UTF-8")
    input = input.decode("ascii", "ignore")
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def ngrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):

        words = input[i:i+n]
        #check if any of the words forming the n-gram is "common"
        if isCommon(words): continue

        ngramTemp = " ".join(words)
        if ngramTemp not in output:
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output

content = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read)), 'utf-8')
ngrams = ngrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key=operator.itemgetter(1),
                      reverse=True)

for ngram, cnt in sortedNGrams:
    if cnt >= 3:
        print(ngram, cnt)

這使：

united states 10
executive department 4
general government 4
same causes 3
legislative body 3
chief magistrate 3
called upon 3
whole country 3
government should 3
mr jefferson 3

如何將功能集成到“ Web Scraping with Python”一書中提出的代碼中

問題描述

1 個解決方案

解決方案1
0 已采納 2017-03-04 17:19:24

如何將功能集成到“ Web Scraping with Python”一書中提出的代碼中

問題描述

1 個解決方案

解決方案1 0 已采納 2017-03-04 17:19:24

解決方案1
0 已采納 2017-03-04 17:19:24