简体   繁体   English

如何使lxml将两个页面保存到页面中,以便树可以读取它?

[英]How can I make lxml save two pages to the pages so it can be read by the tree?

I'm parsing information of a top 100 list from a site that keeps track of crypto coin prices of the top 1000 coins or something like that using xlml. 我正在从一个站点解析前100名列表的信息,该站点跟踪前1000种硬币的加密硬币价格或使用xlml的类似数字。 How can I add the second page to my tree just in case one of my top 100, I'm tracking, falls below the top 100 and ends up on page two? 万一我追踪的前100名中的第一个跌至前100名以下并最终进入第二页,如何将第二页添加到树中? Link to my code: https://github.com/cbat971/CoinScraping/blob/master/WebCrawl.py 链接到我的代码: https : //github.com/cbat971/CoinScraping/blob/master/WebCrawl.py

I've tried making a "page2" variable, adding "," to page variable, adding a "+" to page variable. 我尝试制作一个“ page2”变量,将“,”添加到页面变量,将“ +”添加到页面变量。

from lxml import html
import requests
import datetime
import time

page = requests.get('https://coinmarketcap.com/', 'https://coinmarketcap.com/2')
tree = html.fromstring(page.content)

If all 100 coins I have on the list are on page one, there is no problem. 如果我在列表上拥有的所有100个硬币都在第一页上,那没有问题。 But as soon as one gets pushed to page two, there is an error and no coins after that get processed through the for statement at the end. 但是,一旦有人将其推送到第二页,就会出现错误,并且此后没有任何硬币通过for语句进行处理。

You could try concatenate both HTML using 您可以尝试使用连接两个HTML

page1.content + page2.content

but it will not works because lxml expects only one <html> and one <body> and it will parse only first page and skip other pages. 但是它不起作用,因为lxml只期望一个<html>和一个<body> ,并且它将仅解析第一页并跳过其他页面。

Run code and you get only one ` 运行代码,您只会得到一个`

from lxml import html
import requests

page1 = requests.get('https://coinmarketcap.com/')
page2 = requests.get('https://coinmarketcap.com/2')

tree = html.fromstring(page1.content + page2.content)

print(tree.cssselect('body'))

You have to process every page separatelly - read it, parse it and get values from HTML - and add results to one list/dictionary 您必须分别处理每个页面-读取,解析页面并从HTML中获取值-然后将结果添加到一个列表/字典中

This code gives two <body> 这段代码给出了两个<body>

from lxml import html
import requests

for url in ('https://coinmarketcap.com/', 'https://coinmarketcap.com/2'):
    page = requests.get(url)
    tree = html.fromstring(page.content)
    print(tree.cssselect('body'))

EDIT: 编辑:

from lxml import html
import requests

data = {
    'BTC': 'id-bitcoin',
    'TRX': 'id-tron',
    # ...
    'HC': 'id-hypercash',
    'XZC': 'id-zcoin',
}    

all_results = {}

for url in ('https://coinmarketcap.com/', 'https://coinmarketcap.com/2'):
    page = requests.get(url)
    tree = html.fromstring(page.content)

    print(tree.cssselect('body'))

    for key, val in data.items():

        result = tree.xpath('//*[@id="' + val + '"]/td[4]/a/text()')

        print(key, result)

        if result:
            all_results[key] = result[0]

print('---')
print(all_results)            

Result: 结果:

[<Element body at 0x7f6ba576cd68>]
BTC ['$6144.33']
TRX ['$0.023593']
HC []
XZC []
[<Element body at 0x7f6ba57fb4f8>]
BTC []
TRX []
HC ['$1.05']
XZC ['$6.25']
---
{'BTC': '$6144.33', 'TRX': '$0.023593', 'HC': '$1.05', 'XZC': '$6.25'} 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM