使用scrapy时，在提取值时忽略空白值

Question

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from sample3.items import taamaaItem

class taamaaSpider(BaseSpider):
   name = "taamaa"
   allowed_domains = ["taamaa.com"]
   start_urls = [
       "http://www.taamaa.com/store-directory/"]

   def parse(self, response):
       sel = Selector(response)
       sites = sel.xpath('//div/div[@class="section clearfix col-md-12"]')
       items = []
       list1 = []
       list2 = []
       for site in sites:
           list1 = sites[0].xpath('//div[@class="pull-left col-md-3 merchant"]/div[@class="name"]/a/text()').extract()
           list2 = sites[0].xpath('//div[@class="pull-left col-md-3 merchant"]/div[@class="url"]/a/text()').extract()
       for index in range(len(list2)):
           td = taamaaItem()
           td['name'] = list1[index] 
           td['link'] = list2[index] 
           items.append(td)
       return items

While extracting data it leaves the blank value and fetches the next value of link, thus incorrecting my data alignment. 提取数据时，它保留空白值并获取链接的下一个值，从而使我的数据对齐方式不正确。

Example if A = a , B = , C = c, D = d, E = e 如果A = a，B =，C = c，D = d，E = e

it fetches the output A = a , B = c , C = d , D = e , E = a 它获取输出A = a，B = c，C = d，D = e，E = a

and I want the output to be like this 我希望输出像这样

A = a , B = , C = c, D = d, E = e A = a，B =，C = c，D = d，E = e

how can I achieve this. 我怎样才能做到这一点。

Answer 1

I see 2 strange things: 我看到2件奇怪的事情：

you are using absolute XPath expressions in your loop 您在循环中使用绝对XPath表达式
and you applying them to sites[0] in your loop for each iteration 并在每次迭代中将它们应用于循环中的sites[0]

For your problem grouping 2 lists with some empty text elements, you can use the same structure with a loop on sites but extracting name and link in each iteration, so you don't need intermediate lists 对于将两个带有一些空文本元素的列表分组的问题，您可以在sites上使用具有循环的相同结构，但是在每次迭代中都提取name和link ，因此不需要中间列表

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from sample3.items import taamaaItem

class taamaaSpider(BaseSpider):
   name = "taamaa"
   allowed_domains = ["taamaa.com"]
   start_urls = [
       "http://www.taamaa.com/store-directory/"]

   def parse(self, response):
       sel = Selector(response)
       sites = sel.xpath('//div/div[@class="section clearfix col-md-12"]')
       items = []
       for site in sites:
           td = taamaaItem()           
           td['name'] = site.xpath("""
                .//div[@class="pull-left col-md-3 merchant"]
                    /div[@class="name"]/a/text()""").extract()
           td['link'] = site.xpath("""
                .//div[@class="pull-left col-md-3 merchant"]
                    /div[@class="url"]/a/text()""").extract()
           items.append(td)
       return items

See how I use relative XPath expression ( .//div...... ) 看看我如何使用相对XPath表达式（ .//div...... ）

使用scrapy时，在提取值时忽略空白值

问题描述

1 个解决方案

解决方案1
1 2013-12-30 11:26:09

使用scrapy时，在提取值时忽略空白值

问题描述

1 个解决方案

解决方案1 1 2013-12-30 11:26:09

解决方案1
1 2013-12-30 11:26:09