简体   繁体   中英

scrapy: Remove some elements span element from div with xpath

I'm doing some scraping and I have some elements that I want to exclude. For example, from the main div id="Introduction" I want to scrape only the h2 and the 2 paragraphs and exclude the span class="section_edit_link" and div class="photo_container". I can of course extract the elements I want and join them, but because each section has these 2 elements I want to exclude, is there any way to exclude them on the xpath?

<div id="Introduction"><span class="section_edit_link"><a href="/wiki_edit.cfm?title=Seoul&amp;section=Introduction" title="Edit section: Introduction" rel="nofollow">edit</a> </span>
<h2>Introduction</h2>
<div class="photo_container">
    <a href="https://www.travellerspoint.com/photos/stream/photoID/80/features/countries/South Korea/"><img src="https://photos.travellerspoint.com/8818/thumb_dhessel_seoul.jpg" width="200" height="146" alt="Night time traffic in Seoul" class="photo"></a>
    <h4>Night time traffic in Seoul</h4>
    <p>© All Rights Reserved <a href="https://www.travellerspoint.com/users/Hessell/">Hessell</a></p>
</div>
<p><strong>Seoul</strong> (서울) is the heart of <a href="http://www.travellerspoint.com/guide/South_Korea/">South Korea</a>, hosting about a quarter of the country's population of nearly 50 million. Seoul was also the historic capital of Korea from the 14th century until the nation's partition into <a href="http://www.travellerspoint.com/guide/North_Korea/">North</a> and <a href="http://www.travellerspoint.com/guide/South_Korea/">South</a> in 1948. Located just 50 kilometres south of the North Korean border, Seoul symbolises the division of North and South Korea. </p>
<p>Seoul enjoys a lively nightlife, which has earned it comparisons with <a href="http://www.travellerspoint.com/guide/Tokyo/">Tokyo</a>. Thankfully though, Seoul is much cheaper than the <a href="http://www.travellerspoint.com/guide/Japan/">Japanese</a> capital.</p>

If your Introduction div contains only such elements as shown in the above question then following should give you the desired result:

     yield{
          'heading': response.css('#Introduction > h2').extract_first(),
          'para 1': response.css('#Introduction > p').extract_first(),
          'para 1': response.css('#Introduction > p:last-child').extract_first(),
          }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM