简体   繁体   English

如何使用 Python Scrapy 从本网站获取信息?

[英]How can I obtain info from this website using Python Scrapy?

I've written this code, and I can't obtain results.我已经编写了这段代码,但我无法获得结果。 This is my first time I've tried this, and I don't know what I'm doing wrong.这是我第一次尝试这个,我不知道我做错了什么。 I run and only obtain info for teams in the top of the website, not other ones.我运行并仅在网站顶部获取团队的信息,而不是其他团队。

import scrapy
from bs4 import BeautifulSoup
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join

class FichaClub(Item):
    nombre = Field()
    email = Field()
    zona = Field()

class SacaClubes(CrawlSpider):
    name="Spider100"
    start_urls = ["http://www.ecuafutbol.org/web/asociaciones.php"]
    allowed_domains = ['ecuafutbol.org']

    rules = (
        Rule(LinkExtractor(allow='asociacion_detalle.php*')),
        Rule(LinkExtractor(allow='club.php*'), callback= 'parse_items'),
    )

    def parse_items(self, response):
        item = scrapy.loader.ItemLoader(FichaClub(), response)
        item.add_xpath('email','//a[starts-with(@href, "mail")]/text()')
        item.add_xpath('nombre','//*[@id="gallery-post-1511"]/article/div/div/div/p/strong[1]/text()')
        yield item.load_item()

Correct me if I'm wrong, but it looks like you are trying to scrape the teams from the bottom table.如果我错了,请纠正我,但看起来你正试图从底部的桌子上刮掉球队。 To scrape this data you'll have to specify your parse_items to search for <div class="table-responsive"> .要抓取这些数据,您必须指定 parse_items 以搜索<div class="table-responsive">

You could then iterate through the list and print out / do whatever you'd like with the team names.然后,您可以遍历列表并打印出/对团队名称做任何您想做的事情。 Here is an example of what I'd try using这是我尝试使用的示例

 soccer = BeautifulSoup(start_urls, 'html.parser')
 table = soccer.findAll("div", class_="table-responsive")
 teams = []
for line in table:
       team_found = re.findall(r'([A-Z]\w+-*\w*)', line)
       teams = teams + team_found

Try this out.试试这个。 If it has issues, tinker with the line table = soccer.findAll("div", class_="table-responsive") and change the class name to other elements inside of that table.如果有问题,请修改table = soccer.findAll("div", class_="table-responsive")行并将 class 名称更改为该表内的其他元素。 Make sure to use Chromes Inspect feature to pick apart the HTML.确保使用 Chromes Inspect 功能来挑选 HTML。 Hope this was helpful!希望这有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM