[英]scrapy newbie: tutorial. there is an error when running scrapy crawl dmoz
[英]Scrapy dmoz tutorial, no data for desc in csv file
我遵循了Scrapy官方網站上的dmoz教程,以刮取Python書籍和資源的標題,鏈接和描述。 我在教程上使用了完全相同的蜘蛛,內容為:
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
如果我將print替換為yield,它運行良好並且可以在控制台上打印數據。
但是,當我嘗試使用以下命令將scrapy dmoz -o items.csv -t csv
數據存儲在csv文件中時,就會出現問題: scrapy dmoz -o items.csv -t csv
。 新創建的csv文件僅包含標題和鏈接的數據,而desc的列為空。 有人可以告訴我為什么嗎?
這里有多個問題。
首先,在這種情況下//ul/li
定位器不是最好的定位器,因為它還會匹配沒有說明的頂部菜單和子菜單。
同樣,使用所有多余的空格和換行符來檢索描述,您需要對其進行修剪以獲得清晰的結果。 最“草率”的方法是將Item Loader與輸入和輸出處理器一起使用。
完整的代碼:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, Join
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
class DmozItemLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = Join()
default_item_class = DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul[@class="directory-url"]/li'):
loader = DmozItemLoader(selector=sel)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/@href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
執行后
$ scrapy runspider myspider.py -o items.csv -t csv
這是我在items.csv
得到的:
title,link,desc
Core Python Programming,"http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html"," - By Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall] "
Data Structures and Algorithms with Object-Oriented Design Patterns in Python,http://www.brpreiss.com/books/opus7/html/book.html," - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context. "
...
Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython,http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1," - By Richard Hightower; Addison-Wesley, 2002, 0201616165. Begins with Python basics, many exercises, interactive sessions. Shows programming novices concepts and practical methods. Shows programming experts Python's abilities and ways to interface with Java APIs. [publisher website] "
Python: Visual QuickStart Guide,"http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html"," - By Chris Fehily; Peachpit Press, 2002, ISBN 0201748843. Task-based, step-by-step visual reference guide, many screen shots, for courses in digital graphics; Web design, scripting, development; multimedia, page layout, office tools, operating systems. [Prentice Hall] "
Sams Teach Yourself Python in 24 Hours,http://www.informit.com/store/product.aspx?isbn=0672317354," - By Ivan Van Laningham; Sams Publishing, 2000, ISBN 0672317354. Split into 24 hands-on, 1 hour lessons; steps needed to learn topic: syntax, language features, OO design and programming, GUIs (Tkinter), system administration, CGI. [Sams Publishing] "
Text Processing in Python,http://gnosis.cx/TPiP/," - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.] "
XML Processing with Python,http://www.informit.com/store/product.aspx?isbn=0130211192," - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR] "
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.