簡體   English   中英

如何將功能集成到“ Web Scraping with Python”一書中提出的代碼中

[英]How to integrate function into a piece of code proposed in the book “Web Scraping with Python”

我正在閱讀“使用Python進行網頁搜刮”。 在第8章中,作者介紹了ngram的示例,其中顯示了以下代碼:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
import operator

def cleanInput(input):
    input = re.sub('\n+', " ", input).lower()
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    input = bytes(input, "UTF-8")
    input = input.decode("ascii", "ignore")
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def ngrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if ngramTemp not in output:
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output

content = str(
        urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),
        'utf-8')
ngrams = ngrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key=operator.itemgetter(1),
                      reverse=True)
print(sortedNGrams)

它工作得很好,但是結果中包含了一堆沒有意義的單詞。 為了改進它,作者說可以使用一個新功能:

def isCommon(ngram):
commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it",
               "i", "that", "for", "you", "he", "with", "on", "do", "say",
               "this", "they", "is", "an", "at", "but", "we", "his",
               "from", "that", "not", "by", "she", "or", "as", "what",
               "go", "their", "can", "who", "get", "if", "would", "her",
               "all", "my", "make", "about", "know", "will", "as", "up",
               "one", "time", "has", "been", "there", "year", "so",
               "think", "when", "which", "them", "some", "me", "people",
               "take", "out", "into", "just", "see", "him", "your", "come",
               "could", "now", "than", "like", "other", "how", "then",
               "its", "our", "two", "more", "these", "want", "way", "look",
               "first", "also", "new", "because", "day", "more", "use",
               "no", "man", "find", "here", "thing", "give", "many",
               "well"]
for word in ngram:
    if word in commonWords:
        return True
return False

但是作者沒有說的是如何應用該函數來獲得書中所示的結果:

('united states', 10), ('executive department', 4), ('general governm
ent', 4), ('called upon', 3), ('government should', 3), ('whole count
ry', 3), ('mr jefferson', 3), ('chief magistrate', 3), ('same causes'
, 3), ('legislative body', 3)

有什么想法怎么做?

提前致謝。

這似乎產生您的輸出:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
import operator

def cleanInput(input):
    input = re.sub('\n+', " ", input).lower()
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    input = bytes(input, "UTF-8")
    input = input.decode("ascii", "ignore")
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def ngrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):

        words = input[i:i+n]
        #check if any of the words forming the n-gram is "common"
        if isCommon(words): continue

        ngramTemp = " ".join(words)
        if ngramTemp not in output:
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output

content = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read)), 'utf-8')
ngrams = ngrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key=operator.itemgetter(1),
                      reverse=True)

for ngram, cnt in sortedNGrams:
    if cnt >= 3:
        print(ngram, cnt)

這使:

united states 10
executive department 4
general government 4
same causes 3
legislative body 3
chief magistrate 3
called upon 3
whole country 3
government should 3
mr jefferson 3

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM