简体   繁体   English

为什么我所有的刮板物品都一样?

[英]Why all my items on scrapy are the same?

I am new on Scrapy programming, and I am stuck in a issue. 我是Scrapy编程的新手,我陷入了一个问题。 There is this web site that I want to create a unique item for each element of the table, but every item is the same, and I don't know why, here is my code: 我有一个网站想要为表的每个元素创建一个唯一的项目,但是每个项目都是相同的,我也不知道为什么,这是我的代码:

from scrapy import Spider
from scrapy.selector import Selector

from petroleo.items import PetroleoItem


class PetroleoSpider(Spider):
  name = "petroleo"
  site = "http://www.glossary.oilfield.slb.com/"
  allowed_domains = [site]
  start_urls = [site + 'en/Terms.aspx?filter=sym&LookIn=term%20name&searchtype=starts%20with',]

  def parse(self, response):

  words = Selector(response).xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td")

    for word in words:
        item = PetroleoItem()

        if word.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/em").extract():

            item['title'] = word.xpath(
                    "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/em/text()").extract()[0]
            item['title'] += word.xpath(
                    "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/sub/text()").extract()[0]


        if word.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/i").extract():
            item['title'] = {'en': word.xpath(
                "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/i/text()").extract()}
            item['title']['en'][0] += word.xpath(
                "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/i/sub/text()").extract()[0]

        if word.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/text()").extract():
            item['title'] = {'en': word.xpath(
                "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/text()").extract()}

        yield item

Make your expressions context-specific by prepending a dot and don't repeat the //table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td part: 通过在句点前加一个特定于上下文的表达式,不要重复//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td部分:

words = response.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td")

for word in words:
    item = PetroleoItem()

    if word.xpath("./a/em").extract():
        item['title'] = word.xpath("./a/em/text()").extract()[0]
        item['title'] += word.xpath("./a/sub/text()").extract()[0]

    if word.xpath("./a/i").extract():
        item['title'] = {'en': word.xpath("./a/i/text()").extract()}
        item['title']['en'][0] += word.xpath("./a/i/sub/text()").extract()[0]

    if word.xpath("./a/text()").extract():
        item['title'] = {'en': word.xpath("./a/text()").extract()}

    yield item

I don't particularly like and understand what are you trying to do in the loop, but this should at least solve the problem you've described in the question. 我不是特别喜欢并了解您在循环中要做什么,但这至少应该可以解决您在问题中描述的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM