简体   繁体   English

无法使用 Scrapy 生成数据

[英]Unable to yield data with Scrapy

I am trying to extract data from a site that has terrible html formatting, as all the info I want is in the same div and split with line breaks.我正在尝试从一个具有糟糕 html 格式的站点中提取数据,因为我想要的所有信息都在同一个 div 中并用换行符分隔。 I am new to web scraping in general, so please bear with me.我是 web 一般抓取的新手,所以请多多包涵。

https://wsldata.com/directory/record.cfm?LibID=48 https://wsldata.com/directory/record.cfm?LibID=48

In order to get the parts I need, I use:为了获得我需要的零件,我使用:

details_raw = response.xpath('/html/body/div/table/tbody/tr/td/div/div/text()').getall()

returns回报

['\r\n',
 '\r\n',
 '\r\n',
 '\r\n      \r\n      ',
 '\r\n\t\t\t',
 '\r\n      ',
 '\r\n      ',
 '\r\n      ',
 '\r\n\t\t\t\r\n\t\t\t',
 '\r\n      \r\n      ',
 '\r\n\t\t\tDirector',
 '\r\n       Ext: 5442',
 '\r\n      ',
 '\r\n      ',
 '\r\n\t\t\t\r\n\t\t\t',
 '\r\n      \r\n      ',
 '\r\n\t\t\tAssistant Library Director',
 '\r\n       Ext: 5433',
 '\r\n      ',
 '\r\n      ',
 '\r\n\t\t\t\r\n\t\t\t',
 '\r\n      \r\n      ',
 '\r\n\t\t\tYouth Services Librarian',
 '\r\n      ',
 '\r\n      ',
 '\r\n      ',
 '\r\n\t\t\t\r\n\t\t\t',
 '\r\n      \r\n      ',
 '\r\n\t\t\tTechnical Services Librarian',
 '\r\n       Ext: 2558',
 '\r\n      ',
 '\r\n      ',
 '\r\n\t\t\t\r\n\t\t\t',
 '\r\n      \r\n      ',
 '\r\n\t\t\tOutreach Librarian',
 '\r\n      ',
 '\r\n      ',
 '\r\n      ',
 '\r\n\t\t\t\r\n\t\t\t',
 '\r\n      \r\n      ',
 '\r\n\t\t\tFoundation Executive Director',
 '\r\n       Ext: 5456',
 '\r\n      ',
 '\r\n      ',
 '\r\n\t\t\t\r\n\t\t\t',
 '\r\n      \r\n',
 '\r\n',
 ' \xa0|\xa0 ',
 '\r\n']

I have managed to bring that to a desired format using the following code我已经设法使用以下代码将其转换为所需的格式

import scrapy
import re

class LibspiderSpider(scrapy.Spider):
    name = 'libspider'
    allowed_domains = ['wsldata.com']
    start_urls = ['https://wsldata.com/directory/record.cfm?LibID=48']
    # Note that start_urls contains multiple links, I just simplified it here to reduce cluttering
    
    def parse(self, response):        
           
        details_raw = response.xpath('/html/body/div/table/tbody/tr/td/div/div/text()').getall()
        
        details_clean = []
        titles = []
        details = []

        for detail in details_raw:
            detail = re.sub(r'\t', '', detail)
            detail = re.sub(r'\n', '', detail)
            detail = re.sub(r'\r', '', detail)
            detail = re.sub(r'  ', '', detail)
            detail = re.sub(r' \xa0|\xa0 ', '', detail)
            detail = re.sub(r'|', '', detail)
            detail = re.sub(r' E', 'E', detail)
            if detail == '':
                pass
            elif detail == '|':
                pass
            else:
                details_clean.append(detail)
                if detail[0:3] != 'Ext':
                    titles.append(detail)

        for r in range(len(details_clean)):
            if r == 0:
                details.append(details_clean[r])  
            else:
                if details_clean[r-1][0:3] != 'Ext' and details_clean[r][0:3] != 'Ext':
                    details.append('-')
                    details.append(details_clean[r])
                else:
                    details.append(details_clean[r])
                    
        output = []
        for t in range(len(details)//2):  
            info = {
                "title": details[(t*2)],
                "phone": details[(t*2+1)],
            }
            output.append(info)

The block of code after the response.xpath line is used to clean my input to a nicer output. When testing the code outside of scrapy, using the weird input I showed on the top of post, I get: response.xpath 行之后的代码块用于将我的输入清理为更好的 output。在测试 scrapy 之外的代码时,使用我在帖子顶部显示的奇怪输入,我得到:

[{'title': 'Director', 'phone': 'Ext: 5442'}, {'title': 'Assistant Library Director', 'phone': 'Ext: 5433'}, {'title': 'Youth Services Librarian', 'phone': '-'}, {'title': 'Technical Services Librarian', 'phone': 'Ext: 2558'}, {'title': 'Outreach Librarian', 'phone': '-'}, {'title': 'FoundationExecutive Director', 'phone': 'Ext: 5456'}]

When I try to implement this code into scrapy's parse(), my log doesn't show any items scraped and I obviously get an empty json.当我尝试将这段代码实现到 scrapy 的 parse() 中时,我的日志没有显示任何被抓取的项目,而且我显然得到一个空的 json。

yield is not present in the above code, as I have tried multiple ways to implement it and none of them worked.上面的代码中没有 yield ,因为我尝试了多种方法来实现它,但都没有奏效。 Am I missing a connection between scrapy's response and yield or is what I am trying to do not possible and should just extract the weird list and work it off scrapy like so:我是不是错过了 scrapy 的响应和 yield 之间的联系,或者我试图做的是不可能的,应该只提取奇怪的列表并像这样处理它 scrapy :

    def parse(self, response):        
           
        details_raw = response.xpath('/html/body/div/table/tbody/tr/td/div/div/text()').getall()
        yield{
            'details_in' : details_raw
        }

which extracts:其中提取:

[
{"details_in": ["\r\n", "\r\n", "\r\n", "\r\n      \r\n      ", "\r\n\t\t\t", "\r\n      ", "\r\n      ", "\r\n      ", "\r\n\t\t\t\r\n\t\t\t", "\r\n      \r\n      ", "\r\n\t\t\tDirector", "\r\n       Ext: 5442", "\r\n      ", "\r\n      ", "\r\n\t\t\t\r\n\t\t\t", "\r\n      \r\n      ", "\r\n\t\t\tAssistant Library Director", "\r\n       Ext: 5433", "\r\n      ", "\r\n      ", "\r\n\t\t\t\r\n\t\t\t", "\r\n      \r\n      ", "\r\n\t\t\tYouth Services Librarian", "\r\n      ", "\r\n      ", "\r\n      ", "\r\n\t\t\t\r\n\t\t\t", "\r\n      \r\n      ", "\r\n\t\t\tTechnical Services Librarian", "\r\n       Ext: 2558", "\r\n      ", "\r\n      ", "\r\n\t\t\t\r\n\t\t\t", "\r\n      \r\n      ", "\r\n\t\t\tOutreach Librarian", "\r\n      ", "\r\n      ", "\r\n      ", "\r\n\t\t\t\r\n\t\t\t", "\r\n      \r\n      ", "\r\n\t\t\tFoundation Executive Director", "\r\n       Ext: 5456", "\r\n      ", "\r\n      ", "\r\n\t\t\t\r\n\t\t\t", "\r\n      \r\n", "\r\n", " \u00a0|\u00a0 ", "\r\n"]},
{"details_in": ["\r\n", "\r\n", "\r\n", "\r\n      \r\n      ", "\r\n\t\t\tBranch Librarian", "\r\n      ", "\r\n      ", "\r\n      ", "\r\n\t\t\t\r\n\t\t\t", "\r\n      \r\n", "\r\n", " \u00a0|\u00a0 ", "\r\n"]},
...
...
]

If you want to remove those lines from the list you can use this (instead of regex):如果你想从列表中删除这些行,你可以使用这个(而不是正则表达式):

>>> lst=['\r\n',
...  '\r\n',
...  '\r\n',
...  '\r\n      \r\n      ',
...  '\r\n\t\t\t',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n\t\t\t\r\n\t\t\t',
...  '\r\n      \r\n      ',
...  '\r\n\t\t\tDirector',
...  '\r\n       Ext: 5442',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n\t\t\t\r\n\t\t\t',
...  '\r\n      \r\n      ',
...  '\r\n\t\t\tAssistant Library Director',
...  '\r\n       Ext: 5433',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n\t\t\t\r\n\t\t\t',
...  '\r\n      \r\n      ',
...  '\r\n\t\t\tYouth Services Librarian',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n\t\t\t\r\n\t\t\t',
...  '\r\n      \r\n      ',
...  '\r\n\t\t\tTechnical Services Librarian',
...  '\r\n       Ext: 2558',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n\t\t\t\r\n\t\t\t',
...  '\r\n      \r\n      ',
...  '\r\n\t\t\tOutreach Librarian',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n\t\t\t\r\n\t\t\t',
...  '\r\n      \r\n      ',
...  '\r\n\t\t\tFoundation Executive Director',
...  '\r\n       Ext: 5456',
...  '\r\n      ',
...  '\r\n      ',
...  '\r\n\t\t\t\r\n\t\t\t',
...  '\r\n      \r\n',
...  '\r\n',
...  ' \xa0|\xa0 ',
...  '\r\n']
>>> newlst = [i.strip() for i in lst if i.strip()]
>>> newlst
['Director', 'Ext: 5442', 'Assistant Library Director', 'Ext: 5433', 'Youth Services Librarian', 'Technical Services Librarian', 'Ext: 2558', 'Outreach Librarian', 'Foundation Executive Director', 'Ext: 5456', '|']

You can achieve the result you want by using the correct xpath selectors :您可以通过使用正确的xpath 选择器来获得您想要的结果:

import scrapy


class LibspiderSpider(scrapy.Spider):
    name = 'libspider'
    allowed_domains = ['wsldata.com']
    start_urls = ['https://wsldata.com/directory/record.cfm?LibID=48']

    def parse(self, response):
        details_raw = response.xpath('//div[@class="main"]//div[@style="margin:16px 8px;"]')
        if details_raw:
            details_raw = details_raw[:-1]

        for detail in details_raw:
            item = dict()
            item['title'] = detail.xpath('./following-sibling::br[1]/following::text()').get(default='').strip()
            item['phone'] = detail.xpath('./following-sibling::br[2]/following::text()').get(default='-').strip()
            yield item

The xpath selectors look like this because like you said it's: xpath 选择器看起来像这样,因为就像你说的那样:

a site that has terrible html formatting一个格式糟糕的 html 网站

I'm sure that you can find another xpath selectors that will fit your needs, but this one isn't terrible =).我相信您可以找到另一个 xpath 选择器来满足您的需求,但这个并不糟糕 =)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM