简体   繁体   English

Scrapy dmoz教程,csv文件中没有用于desc的数据

[英]Scrapy dmoz tutorial, no data for desc in csv file

I followed the dmoz tutorial on Scrapy's official website to scrape the titles, links, and descriptions of Python books and resources. 我遵循了Scrapy官方网站上的dmoz教程,以刮取Python书籍和资源的标题,链接和描述。 I used exactly the same spider on the tutorial, which reads: 我在教程上使用了完全相同的蜘蛛,内容为:

import scrapy 
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

It runs fine and can print the data on the console if I replace yield with print. 如果我将print替换为yield,它运行良好并且可以在控制台上打印数据。

But the problem arises when I try to store the scraped data in a csv file using the command: scrapy dmoz -o items.csv -t csv . 但是,当我尝试使用以下命令将scrapy dmoz -o items.csv -t csv数据存储在csv文件中时,就会出现问题: scrapy dmoz -o items.csv -t csv The newly created csv file only have data for title and link, while the column for desc is empty. 新创建的csv文件仅包含标题和链接的数据,而desc的列为空。 Can somebody tell me why? 有人可以告诉我为什么吗?

Multiple issues here. 这里有多个问题。

First of all, the //ul/li locator is not the best one in this case since it would also match the top menues and submenues which don't have descriptions. 首先,在这种情况下//ul/li定位器不是最好的定位器,因为它还会匹配没有说明的顶部菜单和子菜单。

Also, the descriptions are retrieved with all of the extra whitespaces and newline characters which you need to trim to get the clean results. 同样,使用所有多余的空格和换行符来检索描述,您需要对其进行修剪以获得清晰的结果。 The most "scrapic" approach would be to use Item Loaders with input and output processors. 最“草率”的方法是将Item Loader与输入和输出处理器一起使用。

Complete code: 完整的代码:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, Join


class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()


class DmozItemLoader(ItemLoader):
    default_input_processor = MapCompose(unicode.strip)
    default_output_processor = Join()

    default_item_class = DmozItem


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul[@class="directory-url"]/li'):
            loader = DmozItemLoader(selector=sel)

            loader.add_xpath('title', 'a/text()')
            loader.add_xpath('link', 'a/@href')
            loader.add_xpath('desc', 'text()')

            yield loader.load_item()

After executing 执行后

$ scrapy runspider myspider.py -o items.csv -t csv

here is what I get in items.csv : 这是我在items.csv得到的:

title,link,desc
Core Python Programming,"http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html"," - By Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall] "
Data Structures and Algorithms with Object-Oriented Design Patterns in Python,http://www.brpreiss.com/books/opus7/html/book.html," - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context. "
...
Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython,http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1," - By Richard Hightower; Addison-Wesley, 2002, 0201616165. Begins with Python basics, many exercises, interactive sessions. Shows programming novices concepts and practical methods. Shows programming experts Python's abilities and ways to interface with Java APIs. [publisher website] "
Python: Visual QuickStart Guide,"http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html"," - By Chris Fehily; Peachpit Press, 2002, ISBN 0201748843. Task-based, step-by-step visual reference guide, many screen shots, for courses in digital graphics; Web design, scripting, development; multimedia, page layout, office tools, operating systems. [Prentice Hall] "
Sams Teach Yourself Python in 24 Hours,http://www.informit.com/store/product.aspx?isbn=0672317354," - By Ivan Van Laningham; Sams Publishing, 2000, ISBN 0672317354. Split into 24 hands-on, 1 hour lessons; steps needed to learn topic: syntax, language features, OO design and programming, GUIs (Tkinter), system administration, CGI. [Sams Publishing] "
Text Processing in Python,http://gnosis.cx/TPiP/," - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.] "
XML Processing with Python,http://www.informit.com/store/product.aspx?isbn=0130211192," - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR] "

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM