Find the most common words in a website

Question

I am new to python. I have a simple program to find the number of times a word has been used in a website.

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

url = 'http://en.wikipedia.org/wiki/Albert_Einstein'
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
for i in dem:    # loop for each para

    words = re.findall(r'\w+', i.text)   
    cap_words = [word.upper() for word in words]
    word_counts = Counter(cap_words)
    print word_counts

Thing is this gives me the word count para by para, instead of total word count of the website. What change is required. Also if i want to filter out common articles like a, an, the what code do i need to include.

Answer 1

Assuming you really want to find only words contained in paragraphs, and are happy with your regexp, this is the minimal change to get the total word count of the retrieved document:

soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
word_counts = Counter()
for i in dem:    # loop for each para
    words = re.findall(r'\w+', i.text)
    cap_words = [word.upper() for word in words]
    word_counts.update(cap_words)

print word_counts

To ignore common words, one method would be to define a frozenset of ignorable words:

word_counts = Counter()
stopwords = frozenset(('A', 'AN', 'THE'))
for i in dem:    # loop for each para
    words = re.findall(r'\w+', i.text)
    cap_words = [word.upper() for word in words if not word.upper() in stopwords]
    word_counts.update(cap_words)

Find the most common words in a website

Question

1 answers

solution1
1 2013-07-28 02:19:59

Find the most common words in a website

Question

1 answers

solution1 1 2013-07-28 02:19:59

solution1
1 2013-07-28 02:19:59