Python: Fetching and parsing text from html files

Question

I'm trying to work on a project about page ranking.

I want to make an index (dictionary) which looks like this:
file1.html -> [[cat, ate, food, drank, milk], [file2.html, file3.html]]
file2.html -> [[dog, barked, ran, away], [file1.html, file4.html]]

Fetching links is easy - look for anchor tags.

My question is - how do I fetch text? The text in the html files is not enclosed within any tags like <p>

Thanks in advance for all the help

Answer 1

使用HTML解析器-类似于BeautifulSoup 。

Answer 2

If the text isn't enclosed in tags is it really HTML?
As Amber says, you'll have an easier job of this using some HTML parser like BeautifulSoup.

The example below demonstrates a simple method for returning text within tags.
This method works for any tag AFAIK.

>>> from BeautifulSoup import BeautifulSoup as bs
>>> html = '''
... <div><a href="/link1">link1 contents</a></div>
... <div><a href="/link2">link2 contents</a></div>
... '''
>>> soup = bs(html)
>>> for anchor_tag in soup.findAll('a'):
...   print anchor_tag.contents[0]
... 
link1 contents
link2 contents

Apart from that I can imagine that you'd want a dictionary with a count of how many times a certain term appeared in some HTML document. defaultdict is good for that kind of thing:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for anchor_tag in soup.findAll('a'):
...   d[anchor_tag.contents[0]] += 1
... 
>>> d
defaultdict(<type 'int'>, {u'link1 contents': 1, u'link2 contents': 1})

Hopefully that gives you some ideas to run with. Come back and open another question if you run into other issues.

Python: Fetching and parsing text from html files

Question

2 answers

solution1
1 2010-10-16 21:09:12

solution2
0 ACCPTED 2010-10-16 22:31:21

Python: Fetching and parsing text from html files

Question

2 answers

solution1 1 2010-10-16 21:09:12

solution2 0 ACCPTED 2010-10-16 22:31:21

solution1
1 2010-10-16 21:09:12

solution2
0 ACCPTED 2010-10-16 22:31:21