简体   繁体   中英

Python: Fetching and parsing text from html files

I'm trying to work on a project about page ranking.

I want to make an index (dictionary) which looks like this:
file1.html -> [[cat, ate, food, drank, milk], [file2.html, file3.html]]
file2.html -> [[dog, barked, ran, away], [file1.html, file4.html]]

Fetching links is easy - look for anchor tags.

My question is - how do I fetch text? The text in the html files is not enclosed within any tags like <p>

Thanks in advance for all the help

使用HTML解析器-类似于BeautifulSoup

If the text isn't enclosed in tags is it really HTML?
As Amber says, you'll have an easier job of this using some HTML parser like BeautifulSoup.

The example below demonstrates a simple method for returning text within tags.
This method works for any tag AFAIK.

>>> from BeautifulSoup import BeautifulSoup as bs
>>> html = '''
... <div><a href="/link1">link1 contents</a></div>
... <div><a href="/link2">link2 contents</a></div>
... '''
>>> soup = bs(html)
>>> for anchor_tag in soup.findAll('a'):
...   print anchor_tag.contents[0]
... 
link1 contents
link2 contents

Apart from that I can imagine that you'd want a dictionary with a count of how many times a certain term appeared in some HTML document. defaultdict is good for that kind of thing:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for anchor_tag in soup.findAll('a'):
...   d[anchor_tag.contents[0]] += 1
... 
>>> d
defaultdict(<type 'int'>, {u'link1 contents': 1, u'link2 contents': 1})

Hopefully that gives you some ideas to run with. Come back and open another question if you run into other issues.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM