简体   繁体   中英

HTML in between h3/h2 tags with Xpath/BeautifulSoup

I'm using Scrapy for a project and I get the following html:

<h3><span class="my_class">First title</span></h3>
<ul>
    <li>Text for the first title... li #1</li>
</ul>
<ul>
    <li>Text for the first title... li #2</li>
</ul>
<h3><span class="my_class">Second title</span></h3>
<ul>
    <li>Text for the second title... li #1</li>
</ul>
<ul>
    <li>Text for the second title... li #2</li>
</ul>

Now, when I use response.xpath(".//ul/li/text()").extract() it does work, it gives me ["Text for the first title... li #1", "Text for the first title... li #2", "Text for the second title... li #1", "Text for the second title... li #2"] But this is partially what I want.

I want two lists, one for First title and another one for Second title . That way the outcome will be:

first_title = ["Text for the first title... li #1", "Text for the first title... li #2"]
second_title = ["Text for the second title... li #1", "Text for the second title... li #2"]

I still don't have a clue how to achieve this. I'm currently using Scrapy to get the HTML; A solution using xpath with pure Python will be ideal for me. But somehow I believe BeautifulSoup will be useful for this kind of task.

Do you have any ideas how to perform this in Python?

A way to do this with Beautiful Soup would be the following. (I've stored the results in a dict rather than separately named lists, in case you don't know in advance how many you'll have.)

from bs4 import BeautifulSoup

soup = BeautifulSoup(url)
groups = soup.find_all('ul')
results = {}
for group in groups:
   results[group.find_previous_sibling().text] = [e.text for e in a.find_all('li')]

If you want to use BeautifulSoup you can utilize the findNext method:

h3s = soup.find_all("h3")
for h3 in h3s:
    print h3.text
    print h3.findNext("ul").text

In this case BS is a bit easier to use because it can find siblings of elements easier.

With simple XPath you could do something like this:

h3s = data.xpath('//h3')
for h3 in h3s:
    print h3.xpath('.//text()')
    h3.xpath('./following-sibling::ul')[0].xpath('.//text()')

This is fixed for your example above. If you need some general approach I would say BS is the right tool because of the methods available.

You can use XPath and CSS selectors in Scrapy.

Here's an example solution (in an ipython session ; I only changed #1 and #2 in the 2nd block to #3 and #4 to make is more obvious):

In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""<h3><span class="my_class">First title</span></h3>
   ...: <ul>
   ...:     <li>Text for the first title... li #1</li>
   ...:     <li>Text for the first title... li #2</li>
   ...: </ul>
   ...: <h3><span class="my_class">Second title</span></h3>
   ...: <ul>
   ...:     <li>Text for the second title... li #3</li>
   ...:     <li>Text for the second title... li #4</li>
   ...: </ul>""")

In [3]: for title_list in selector.css('h3 + ul'):
   ...:         print title_list.xpath('./li/text()').extract()
   ...:     
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [4]: for title_list in selector.css('h3 + ul'):
        print title_list.css('li::text').extract()
   ...:     
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [5]: 

Edit, after OP's question in comment:

Every <li> tag is enclosed in its own <ul> (...) Is there any way to extend that to make it look for all the ul tags below the h3 tag?

If h3 and ul are all siblings, one way to select the ul s that are before the next h3 is to count preceding h3 siblings

Consider this input HTML snippet:

<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>

<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>

The first <ul><li> line has 1 precending h3 sibling, the 3rd <ul><li> line has 2 preceding h3 siblings.

So for each h3 , you want following ul siblings which have exactly the number of h3 you've seen so far.

First:

following-sibling::ul[count(preceding-sibling::h3)=1]

then,

following-sibling::ul[count(preceding-sibling::h3)=2]

and so on.

Here is this idea in action with the help of enumerate() on h3 selection (remember that XPath positions start at 1 , not 0):

In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""
<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>

<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>
""")

In [3]: for cnt, title in enumerate(selector.css('h3'), start=1):
   ...:     print title.xpath('following-sibling::ul[count(preceding-sibling::h3)=%d]/li/text()' % cnt).extract()
   ...: 
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM