简体   繁体   中英

Python Html: Extract Parts of Text from html file

I'm currently working on a project where I downloaded a bunch of related html files and I gather data from them. One thing I noticed is that even though the overall format of the html files are similar, sometimes various files use different tags for storing similar information.

For example, in one file it could be:

<html>
<head>
<p> Title: The GodFather </p>
<p> Author: Mario Puzo </p>
</head>
<html>

And in another example it could be:

<html>
<head>
<p> Heading </p>
<pre> Ebook from xyz site: Please donate to our foundation at www.abc.com
Title: The GodFather
Author: Mario Puzo
</pre>
</head>
</html>

I can say for sure that " Title: " and " Author: " are common in all the html files. I want to extract the text that is next to " Title: " and " Author: ". I'm supposing I use beautiful soup to extract each html file. But to extract Title and Author , would using regular expressions be best?

Don't even bother with beautiful soup, just use regular expression:

re.findall(r'(?<=Author:).*?(?=<)', html.replace('\n', ''))
>>> [' Mario Puzo']

re.findall(r'(?<=Title:).*?(?=<)', html.replace('\n', ''))
>>> [' The GodFatherAuthor: Mario Puzo']

This will match Authors, granted, it may not work for some of your title because it also contained 'Author' before another tag (shown above), in which case you can do title.split('Author')[0] for all title crawled, since if Author is not in the string this method will not change the string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM