I'm using Scrapy
for a project and I get the following html:
<h3><span class="my_class">First title</span></h3>
<ul>
<li>Text for the first title... li #1</li>
</ul>
<ul>
<li>Text for the first title... li #2</li>
</ul>
<h3><span class="my_class">Second title</span></h3>
<ul>
<li>Text for the second title... li #1</li>
</ul>
<ul>
<li>Text for the second title... li #2</li>
</ul>
Now, when I use response.xpath(".//ul/li/text()").extract()
it does work, it gives me ["Text for the first title... li #1", "Text for the first title... li #2", "Text for the second title... li #1", "Text for the second title... li #2"]
But this is partially what I want.
I want two lists, one for First title
and another one for Second title
. That way the outcome will be:
first_title = ["Text for the first title... li #1", "Text for the first title... li #2"]
second_title = ["Text for the second title... li #1", "Text for the second title... li #2"]
I still don't have a clue how to achieve this. I'm currently using Scrapy
to get the HTML; A solution using xpath
with pure Python
will be ideal for me. But somehow I believe BeautifulSoup
will be useful for this kind of task.
Do you have any ideas how to perform this in Python?
A way to do this with Beautiful Soup would be the following. (I've stored the results in a dict rather than separately named lists, in case you don't know in advance how many you'll have.)
from bs4 import BeautifulSoup
soup = BeautifulSoup(url)
groups = soup.find_all('ul')
results = {}
for group in groups:
results[group.find_previous_sibling().text] = [e.text for e in a.find_all('li')]
If you want to use BeautifulSoup you can utilize the findNext
method:
h3s = soup.find_all("h3")
for h3 in h3s:
print h3.text
print h3.findNext("ul").text
In this case BS is a bit easier to use because it can find siblings of elements easier.
With simple XPath you could do something like this:
h3s = data.xpath('//h3')
for h3 in h3s:
print h3.xpath('.//text()')
h3.xpath('./following-sibling::ul')[0].xpath('.//text()')
This is fixed for your example above. If you need some general approach I would say BS is the right tool because of the methods available.
You can use XPath and CSS selectors in Scrapy.
Here's an example solution (in an ipython session ; I only changed #1 and #2 in the 2nd block to #3 and #4 to make is more obvious):
In [1]: import scrapy
In [2]: selector = scrapy.Selector(text="""<h3><span class="my_class">First title</span></h3>
...: <ul>
...: <li>Text for the first title... li #1</li>
...: <li>Text for the first title... li #2</li>
...: </ul>
...: <h3><span class="my_class">Second title</span></h3>
...: <ul>
...: <li>Text for the second title... li #3</li>
...: <li>Text for the second title... li #4</li>
...: </ul>""")
In [3]: for title_list in selector.css('h3 + ul'):
...: print title_list.xpath('./li/text()').extract()
...:
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
In [4]: for title_list in selector.css('h3 + ul'):
print title_list.css('li::text').extract()
...:
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
In [5]:
Edit, after OP's question in comment:
Every
<li>
tag is enclosed in its own<ul>
(...) Is there any way to extend that to make it look for all theul
tags below theh3
tag?
If h3
and ul
are all siblings, one way to select the ul
s that are before the next h3
is to count preceding h3
siblings
Consider this input HTML snippet:
<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>
<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>
The first <ul><li>
line has 1 precending h3
sibling, the 3rd <ul><li>
line has 2 preceding h3
siblings.
So for each h3
, you want following ul
siblings which have exactly the number of h3
you've seen so far.
First:
following-sibling::ul[count(preceding-sibling::h3)=1]
then,
following-sibling::ul[count(preceding-sibling::h3)=2]
and so on.
Here is this idea in action with the help of enumerate()
on h3
selection (remember that XPath positions start at 1 , not 0):
In [1]: import scrapy
In [2]: selector = scrapy.Selector(text="""
<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>
<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>
""")
In [3]: for cnt, title in enumerate(selector.css('h3'), start=1):
...: print title.xpath('following-sibling::ul[count(preceding-sibling::h3)=%d]/li/text()' % cnt).extract()
...:
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.