简体   繁体   English

使用scrapy时,在提取值时忽略空白值

[英]While using scrapy its ignoring blank values while extracting values

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from sample3.items import taamaaItem

class taamaaSpider(BaseSpider):
   name = "taamaa"
   allowed_domains = ["taamaa.com"]
   start_urls = [
       "http://www.taamaa.com/store-directory/"]

   def parse(self, response):
       sel = Selector(response)
       sites = sel.xpath('//div/div[@class="section clearfix col-md-12"]')
       items = []
       list1 = []
       list2 = []
       for site in sites:
           list1 = sites[0].xpath('//div[@class="pull-left col-md-3 merchant"]/div[@class="name"]/a/text()').extract()
           list2 = sites[0].xpath('//div[@class="pull-left col-md-3 merchant"]/div[@class="url"]/a/text()').extract()
       for index in range(len(list2)):
           td = taamaaItem()
           td['name'] = list1[index] 
           td['link'] = list2[index] 
           items.append(td)
       return items

While extracting data it leaves the blank value and fetches the next value of link, thus incorrecting my data alignment. 提取数据时,它保留空白值并获取链接的下一个值,从而使我的数据对齐方式不正确。

Example if A = a , B = , C = c, D = d, E = e 如果A = a,B =,C = c,D = d,E = e

it fetches the output A = a , B = c , C = d , D = e , E = a 它获取输出A = a,B = c,C = d,D = e,E = a

and I want the output to be like this 我希望输出像这样

A = a , B = , C = c, D = d, E = e A = a,B =,C = c,D = d,E = e

how can I achieve this. 我怎样才能做到这一点。

I see 2 strange things: 我看到2件奇怪的事情:

  • you are using absolute XPath expressions in your loop 您在循环中使用绝对XPath表达式
  • and you applying them to sites[0] in your loop for each iteration 并在每次迭代中将它们应用于循环中的sites[0]

For your problem grouping 2 lists with some empty text elements, you can use the same structure with a loop on sites but extracting name and link in each iteration, so you don't need intermediate lists 对于将两个带有一些空文本元素的列表分组的问题,您可以在sites上使用具有循环的相同结构,但是在每次迭代中都提取namelink ,因此不需要中间列表

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from sample3.items import taamaaItem

class taamaaSpider(BaseSpider):
   name = "taamaa"
   allowed_domains = ["taamaa.com"]
   start_urls = [
       "http://www.taamaa.com/store-directory/"]

   def parse(self, response):
       sel = Selector(response)
       sites = sel.xpath('//div/div[@class="section clearfix col-md-12"]')
       items = []
       for site in sites:
           td = taamaaItem()           
           td['name'] = site.xpath("""
                .//div[@class="pull-left col-md-3 merchant"]
                    /div[@class="name"]/a/text()""").extract()
           td['link'] = site.xpath("""
                .//div[@class="pull-left col-md-3 merchant"]
                    /div[@class="url"]/a/text()""").extract()
           items.append(td)
       return items

See how I use relative XPath expression ( .//div...... ) 看看我如何使用相对XPath表达式( .//div......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM