简体   繁体   English

从Scrapy输出中删除文本

[英]Remove text from Scrapy Output

Below is a sample piece of HTML code that I want to scrape with scrapy. 以下是我想抓取的HTML代码示例。

<body>
<h2 class="post-title entry-title">Sample Header</h2>
    <div class="entry clearfix">
        <div class="sample1">
            <p>Hello</p>
        </div>
        <!--start comment-->
        <div class="sample2">
            <p>World</p>
        </div>
        <!--end comment-->
    </div>
<ul class="post-categories">
<li><a href="123.html">Category1</a></li>
<li><a href="456.html">Category2</a></li>
<li><a href="789.html">Category3</a></li>
</ul>
</body>

Right now I am using the below working scrapy code: 现在,我正在使用下面的工作scrapy代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from isbullshit.items import IsBullshitItem

class IsBullshitSpider(CrawlSpider):
    name = 'isbullshit'
    start_urls = ['http://sample.com']
    rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
        Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')]

    def parse_blogpost(self, response):
        hxs = HtmlXPathSelector(response)
        item = IsBullshitItem()
        item['title'] = hxs.select('//h2[@class="post-title entry-title"]/text()').extract()[0]
        item['tag'] = hxs.select('//ul[@class="post-categories"]/li[1]/a/text()').extract()[0]
        item['article_html'] = hxs.select("//div[@class='entry clearfix']").extract()[0]
        return item

It gives me the following xml output: 它为我提供了以下xml输出:

<?xml version="1.0" encoding="utf-8"?>
<items>
    <item>

        <article_html>
        <div class="entry clearfix">
        <div class="sample1">
            <p>Hello</p>
        </div>
        <!--start comment-->
        <div class="sample2">
            <p>World</p>
        </div>
        <!--end comment-->
        </div>      
        </article_html>

        <tag>
        Category1
        </tag>

        <title>
        Sample Header
        </title>

    </item>
</items>

I want to know how to achieve the following output: 我想知道如何实现以下输出:

<?xml version="1.0" encoding="utf-8"?>
<items>
    <item>

        <article_html>
        <div class="entry clearfix">
        <div class="sample1">
            <p>Hello</p>
        </div>
        <!--start comment-->
        <!--end comment-->
        </div>      
        </article_html>

        <tag>
        Category1,Category2,Category3
        </tag>

        <title>
        Sample Header
        </title>

    </item>
</items>

Note: The number of categories depends on the post. 注意:类别数取决于帖子。 In the above example, there are 3 categories. 在上面的示例中,有3个类别。 There could be more or less. 可能会有更多或更少。

Help would be much appreciated. 帮助将不胜感激。 Cheers. 干杯。

Use Scrapy Item Loaders . 使用Scrapy项目加载程序 There you can specify how to treat multiple inputs for one field. 您可以在此处指定如何处理一个字段的多个输入。 You can use TakeFirst preprocessor to only take the first value, or you can use Join preprocessor to combine all of them into a list. 您可以使用TakeFirst预处理器仅获取第一个值,也可以使用Join预处理器将所有它们组合到一个列表中。 Or you can write your own. 或者您可以编写自己的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM