If I have this division:
<div class="wikicontent" id="wikicontentid">
How can I use Python to print just that tag and its' contents?
You can use BeautifulSoup :
import bs4
soup = bs4BeautifulSoup(html_content);
result = soup.find("div", { "class" : "wikicontent", "id" : "wikicontentid" })
Use the Beautiful Soup module.
>>> import bs4
Suppose we have a document that contains a number of divs, some which match the class and some which match the id, and one that does both:
>>> html = '<div class="wikicontent">blah1</div><div class="wikicontent" id="wikicontentid">blah2</div><div id="wikicontentid">blah3</div>'
We can parse with Beautiful Soup:
>>> soup = bs4.BeautifulSoup(html)
To find all the divs:
>>> soup.find_all('div')
[<div class="wikicontent">blah1</div>, <div class="wikicontent" id="wikicontentid">blah2</div>, <div id="wikicontentid">blah3</div>]
This is a bs4.element.ResultSet
that contains three bs4.element.Tag
which you can extract via the []
operator.
To find everything matching a given id, use the id
keyword argument:
>>> soup.find_all(id='wikicontentid')
[<div class="wikicontent" id="wikicontentid">blah2</div>, <div id="wikicontentid">blah3</div>]
To match a class, use the class_
keyword argument (note the underscore):
>>> soup.find_all(class_='wikicontent')
[<div class="wikicontent">blah1</div>, <div class="wikicontent" id="wikicontentid">blah2</div>]
You can combine these selectors in a single call:
>>> soup.find_all('div', class_='wikicontent', id='wikicontentid')
[<div class="wikicontent" id="wikicontentid">blah2</div>]
If you know there is only one match or if you are only interested in the first match, use soup.find
:
>>> soup.find(class_='wikicontent', id='wikicontentid')
<div class="wikicontent" id="wikicontentid">blah2</div>
As before, this is not a string,
>>> type(soup.find('div', class_='wikicontent', id='wikicontentid'))
<class 'bs4.element.Tag'>
but you can turn it into one:
>>> str(soup.find('div', class_='wikicontent', id='wikicontentid'))
'<div class="wikicontent" id="wikicontentid">blah2</div>'
To download the page source use http://docs.python-requests.org/en/latest/ , for parsing html/css tags use http://lxml.de/ .
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://theurlyourscraping.com').content)
wikicontent = [x for x in dom.xpath('//div[@class="wikicontent"]/text()')]
print wikicontent
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.