I'm trying to extract a paragraph of brand description on this website using Scrapy: http://us.asos.com/hope-and-ivy/hope-ivy-dotty-mesh-midi-dress-with-ruffle-detail/prd/8663409?clr=black&cid=2623&pgesize=36&pge=0&totalstyles=627&gridsize=3&gridrow=1&gridcolumn=1
The HTML element looks like this:
<div class="brand-description">
<h4>Brand</h4>
<span>"Prom queens and wedding guests, claim the best-dressed title in "
<a href="/Women/A-To-Z-Of-Brands/Hope-And-Ivy/Cat/pgecategory.aspx?cid=21368">
<strong>"Hope and Ivy's"</strong>
</a>
"occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses."
</span>
</div>
My desired result is:
"Prom queens and wedding guests, claim the best-dressed title in Hope and Ivy's occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses."
I tried this method:
response.css("div.brand-description span::text").extract()
However, the text list I got is missing those inside the "strong" tag, which is "Hope and Ivy's":
['Prom queens and wedding guests, claim the best-dressed title in ', ' occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses.']
My question is, can I get the plain text without the attention to the "href" tag?
You still might have to do some post-processing, but this is probably the best you can do:
response.xpath('normalize-space(//div[@class="brand-description"]/span)').extract_first()
which will give you
u'"Prom queens and wedding guests, claim the best-dressed title in " "Hope and Ivy\'s" "occasion-ready collection. Shop its notice-me styles for hand-painted florals, Bardot necklines and figure-flattering pencil dresses."'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.