简体   繁体   中英

How to extract nested text in Scrapy?

I'm trying to extract a paragraph of brand description on this website using Scrapy: http://us.asos.com/hope-and-ivy/hope-ivy-dotty-mesh-midi-dress-with-ruffle-detail/prd/8663409?clr=black&cid=2623&pgesize=36&pge=0&totalstyles=627&gridsize=3&gridrow=1&gridcolumn=1

The HTML element looks like this:

<div class="brand-description">
  <h4>Brand</h4>
  <span>"Prom queens and wedding guests, claim the best-dressed title in "
    <a href="/Women/A-To-Z-Of-Brands/Hope-And-Ivy/Cat/pgecategory.aspx?cid=21368">
      <strong>"Hope and Ivy's"</strong>
    </a> 
    "occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses."
  </span>
</div>

My desired result is:

"Prom queens and wedding guests, claim the best-dressed title in Hope and Ivy's occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses."

I tried this method:

response.css("div.brand-description span::text").extract()

However, the text list I got is missing those inside the "strong" tag, which is "Hope and Ivy's":

['Prom queens and wedding guests, claim the best-dressed title in ',  ' occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses.']

My question is, can I get the plain text without the attention to the "href" tag?

You still might have to do some post-processing, but this is probably the best you can do:

response.xpath('normalize-space(//div[@class="brand-description"]/span)').extract_first()

which will give you

u'"Prom queens and wedding guests, claim the best-dressed title in " "Hope and Ivy\'s" "occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses."'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM