简体   繁体   English

无法运行我的班级搜寻器

[英]Trouble running my class crawler

Writing a class crawler in python, I got stuck on the half-way. 用python编写类爬虫,我陷入了中途。 I can't find any idea how to pass the newly produced links [generated by app_crawler class] to the "App" class so that I can do the rest over there. 我不知道如何将[app_crawler类生成的]新生成的链接传递给“ App”类,以便在那里进行其余的操作。 If anyone points me into the right direction by showing how can I run it, I would be very helpful. 如果有人通过展示如何运行它为我指明了正确的方向,那么我将非常有帮助。 Thanks in advance. 提前致谢。 Btw, it is also running but only for a single link. 顺便说一句,它也正在运行,但仅用于单个链接。

from lxml import html
import requests

class app_crawler:

    starturl = "https://itunes.apple.com/us/app/candy-crush-saga/id553834731?mt=8"

    def crawler(self):
        self.get_app(self.starturl)


    def get_app(self, link):
        page = requests.get(link)
        tree = html.fromstring(page.text)
        links = tree.xpath('//div[@class="lockup-info"]//*/a[@class="name"]/@href')
        for link in links:
            return link # I wish to make this link penetrate through the App class but can't get any idea


class App(app_crawler):

    def __init__(self, link):
        self.links = [link]

    def process_links(self):
        for link in self.links:
            self.get_item(link)

    def get_item(self, url):
        page = requests.get(url)
        tree = html.fromstring(page.text)
        name = tree.xpath('//h1[@itemprop="name"]/text()')[0]
        developer = tree.xpath('//div[@class="left"]/h2/text()')[0]        
        price = tree.xpath('//div[@itemprop="price"]/text()')[0]
        print(name, developer, price)

if __name__ == '__main__':

    parse = App(app_crawler.starturl)
    parse.crawler()
    parse.process_links()

I've created another one which is working fine but I wanted to make the above crawler to get a different look. 我创建了另一个工作正常的工具,但我想使上述爬虫具有不同的外观。 Here is the link for the working one: " https://www.dropbox.com/s/galjorcdynueequ/Working%20one.txt?dl=0 " 这是工作对象的链接:“ https://www.dropbox.com/s/galjorcdynueequ/Working%20one.txt?dl=0

There are several issues with your code: 您的代码有几个问题:

  • App inherits from app_crawler yet you provide an app_crawler instance to App.__init__ . App继承自app_crawler但您向App.__init__提供了一个app_crawler实例。

  • App.__init__ calls app_crawler.__init__ instead of super().__init__() . App.__init__调用app_crawler.__init__而不是super().__init__()

  • Not only app_crawler.get_app doesn't actually return anything , it creates a brand new App object. app_crawler.get_app不仅不返回任何内容 ,还创建了一个全新的App对象。

This results in your code passing an app_crawler object to requests.get instead of a url string. 这将导致您的代码传递一个app_crawler对象requests.get ,而不是一个URL字符串。

You have too much encapsulation in your code. 您的代码中封装太多

Consider the following code that is shorter than your not-working code, cleaner and without needing to needlessly pass objects around: 考虑下面的代码,这些代码比不工作的代码短,更干净,并且不需要不必要地传递对象:

from lxml import html
import requests

class App:
    def __init__(self, starturl):
        self.starturl = starturl
        self.links = []

    def get_links(self):
        page = requests.get(self.starturl)
        tree = html.fromstring(page.text)
        self.links = tree.xpath('//div[@class="lockup-info"]//*/a[@class="name"]/@href')

    def process_links(self):
        for link in self.links:
            self.get_docs(link)

    def get_docs(self, url):
        page = requests.get(url)
        tree = html.fromstring(page.text)
        name = tree.xpath('//h1[@itemprop="name"]/text()')[0]
        developper = tree.xpath('//div[@class="left"]/h2/text()')[0]
        price = tree.xpath('//div[@itemprop="price"]/text()')[0]
        print(name, developper, price)

if __name__ == '__main__':
    parse = App("https://itunes.apple.com/us/app/candy-crush-saga/id553834731?mt=8")
    parse.get_links()
    parse.process_links()

outputs 输出

Cookie Jam By Jam City, Inc. Free
Zombie Tsunami By Mobigame Free
Flow Free By Big Duck Games LLC Free
Bejeweled Blitz By PopCap Free
Juice Jam By Jam City, Inc. Free
Candy Crush Soda Saga By King Free
Bubble Witch 3 Saga By King Free
Candy Crush Jelly Saga By King Free
Farm Heroes Saga By King Free
Pet Rescue Saga By King Free

This is the way I was expecting my code should be: 这是我期望代码的方式:

from lxml import html
import requests

class app_crawler:

    starturl = "https://itunes.apple.com/us/app/candy-crush-saga/id553834731?mt=8"

    def __init__(self):
        self.links = [self.starturl]


    def crawler(self):
        for link in self.links:
            self.get_app(link)


    def get_app(self, link):
        page = requests.get(link)
        tree = html.fromstring(page.text)
        links = tree.xpath('//div[@class="lockup-info"]//*/a[@class="name"]/@href')
        for link in links:
            if not len(self.links)>=5:
                self.links.append(link)


class App(app_crawler):

    def __init__(self):
        app_crawler.__init__(self)


    def process_links(self):
        for link in self.links:
            self.get_item(link)

    def get_item(self, url):
        page = requests.get(url)
        tree = html.fromstring(page.text)
        name = tree.xpath('//h1[@itemprop="name"]/text()')[0]
        developer = tree.xpath('//div[@class="left"]/h2/text()')[0]        
        price = tree.xpath('//div[@itemprop="price"]/text()')[0]
        print(name, developer, price)

if __name__ == '__main__':

    scrape = App()
    scrape.crawler()
    scrape.process_links()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM