簡體   English   中英

Scrapy dmoz教程,csv文件中沒有用於desc的數據

[英]Scrapy dmoz tutorial, no data for desc in csv file

我遵循了Scrapy官方網站上的dmoz教程,以刮取Python書籍和資源的標題,鏈接和描述。 我在教程上使用了完全相同的蜘蛛,內容為:

import scrapy 
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

如果我將print替換為yield,它運行良好並且可以在控制台上打印數據。

但是,當我嘗試使用以下命令將scrapy dmoz -o items.csv -t csv數據存儲在csv文件中時,就會出現問題: scrapy dmoz -o items.csv -t csv 新創建的csv文件僅包含標題和鏈接的數據,而desc的列為空。 有人可以告訴我為什么嗎?

這里有多個問題。

首先,在這種情況下//ul/li定位器不是最好的定位器,因為它還會匹配沒有說明的頂部菜單和子菜單。

同樣,使用所有多余的空格和換行符來檢索描述,您需要對其進行修剪以獲得清晰的結果。 最“草率”的方法是將Item Loader與輸入和輸出處理器一起使用。

完整的代碼:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, Join


class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()


class DmozItemLoader(ItemLoader):
    default_input_processor = MapCompose(unicode.strip)
    default_output_processor = Join()

    default_item_class = DmozItem


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul[@class="directory-url"]/li'):
            loader = DmozItemLoader(selector=sel)

            loader.add_xpath('title', 'a/text()')
            loader.add_xpath('link', 'a/@href')
            loader.add_xpath('desc', 'text()')

            yield loader.load_item()

執行后

$ scrapy runspider myspider.py -o items.csv -t csv

這是我在items.csv得到的:

title,link,desc
Core Python Programming,"http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html"," - By Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall] "
Data Structures and Algorithms with Object-Oriented Design Patterns in Python,http://www.brpreiss.com/books/opus7/html/book.html," - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context. "
...
Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython,http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1," - By Richard Hightower; Addison-Wesley, 2002, 0201616165. Begins with Python basics, many exercises, interactive sessions. Shows programming novices concepts and practical methods. Shows programming experts Python's abilities and ways to interface with Java APIs. [publisher website] "
Python: Visual QuickStart Guide,"http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html"," - By Chris Fehily; Peachpit Press, 2002, ISBN 0201748843. Task-based, step-by-step visual reference guide, many screen shots, for courses in digital graphics; Web design, scripting, development; multimedia, page layout, office tools, operating systems. [Prentice Hall] "
Sams Teach Yourself Python in 24 Hours,http://www.informit.com/store/product.aspx?isbn=0672317354," - By Ivan Van Laningham; Sams Publishing, 2000, ISBN 0672317354. Split into 24 hands-on, 1 hour lessons; steps needed to learn topic: syntax, language features, OO design and programming, GUIs (Tkinter), system administration, CGI. [Sams Publishing] "
Text Processing in Python,http://gnosis.cx/TPiP/," - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.] "
XML Processing with Python,http://www.informit.com/store/product.aspx?isbn=0130211192," - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR] "

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM