Hi I started scrapy recently,and wrote a crawler. But when outputting the data to csv,they are all printed in a single row. How can print each data to its own row?
I my case am printing links from a website. It works well when printed in json format.
Here's the code.
The items.py file.
import scrapy
from scrapy.item import Item ,Field
class ErcessassignmentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
link = Field()
#pass
The mycrawler.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector # deprecated
from scrapy.selector import Selector
from ercessAssignment.items import ErcessassignmentItem
class MySpider(BaseSpider):
name ="ercessSpider"
allowed_domains =["site_url"]
start_urls = ["site_url"]
def parse(self, response):
hxs = Selector(response)
links = hxs.xpath("//p")
items = []
for linkk in links:
item = ErcessassignmentItem()
item["link"] = linkk.xpath("//a/@href").extract()
items.append(item)
return items`
You should have proper indentation in code
import scrapy
from scrapy.item import Item ,Field
class ErcessassignmentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
link = Field()
Then in your spider, do not use return
, your for loop will run only once and you will only have 1 row printed in CSV, instead use yield
Second, where is your code to put items into CSV? I guess you are using scrapy's default way of storing items, in case you already do not know, please run your scraper like
scrapy crawl ercessSpider -o my_output.csv
Your spider code should be like this, notice changes I made
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector # deprecated
from scrapy.selector import Selector
from ercessAssignment.items import ErcessassignmentItem
class MySpider(BaseSpider):
name ="ercessSpider"
allowed_domains =["site_url"]
start_urls = ["site_url"]
def parse(self, response):
hxs = Selector(response)
links = hxs.xpath("//p")
for linkk in links:
item = ErcessassignmentItem()
item["link"] = linkk.xpath("//a/@href").extract()
yield item
for linkk in links:
item = ErcessassignmentItem()
item["link"] = xpath("//a/@href").extract()[linkk]
yield item
this works good in css selector but if above two solutions are not working then you can try this.
Your code above does not print
anything. Moreover, I don't see any .csv
part. Also, your items
list created in parse()
will never be longer than 1 due to something that looks like an indentation error to me (ie you return
after the first iteration of the for-loop
. For better readability, you could use the for/else construct here:
def parse(self, response):
hxs = Selector(response)
links = hxs.xpath("//p")
items = []
for linkk in links:
item = ErcessassignmentItem()
item["link"] = linkk.xpath("//a/@href").extract()
items.append(item)
else: # after for loop is finished
# either return items
# or print link in items here without returning
for link in items: # take one link after another
print link # and print it in one line each
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.