简体   繁体   中英

How can I make lxml save two pages to the pages so it can be read by the tree?

I'm parsing information of a top 100 list from a site that keeps track of crypto coin prices of the top 1000 coins or something like that using xlml. How can I add the second page to my tree just in case one of my top 100, I'm tracking, falls below the top 100 and ends up on page two? Link to my code: https://github.com/cbat971/CoinScraping/blob/master/WebCrawl.py

I've tried making a "page2" variable, adding "," to page variable, adding a "+" to page variable.

from lxml import html
import requests
import datetime
import time

page = requests.get('https://coinmarketcap.com/', 'https://coinmarketcap.com/2')
tree = html.fromstring(page.content)

If all 100 coins I have on the list are on page one, there is no problem. But as soon as one gets pushed to page two, there is an error and no coins after that get processed through the for statement at the end.

You could try concatenate both HTML using

page1.content + page2.content

but it will not works because lxml expects only one <html> and one <body> and it will parse only first page and skip other pages.

Run code and you get only one `

from lxml import html
import requests

page1 = requests.get('https://coinmarketcap.com/')
page2 = requests.get('https://coinmarketcap.com/2')

tree = html.fromstring(page1.content + page2.content)

print(tree.cssselect('body'))

You have to process every page separatelly - read it, parse it and get values from HTML - and add results to one list/dictionary

This code gives two <body>

from lxml import html
import requests

for url in ('https://coinmarketcap.com/', 'https://coinmarketcap.com/2'):
    page = requests.get(url)
    tree = html.fromstring(page.content)
    print(tree.cssselect('body'))

EDIT:

from lxml import html
import requests

data = {
    'BTC': 'id-bitcoin',
    'TRX': 'id-tron',
    # ...
    'HC': 'id-hypercash',
    'XZC': 'id-zcoin',
}    

all_results = {}

for url in ('https://coinmarketcap.com/', 'https://coinmarketcap.com/2'):
    page = requests.get(url)
    tree = html.fromstring(page.content)

    print(tree.cssselect('body'))

    for key, val in data.items():

        result = tree.xpath('//*[@id="' + val + '"]/td[4]/a/text()')

        print(key, result)

        if result:
            all_results[key] = result[0]

print('---')
print(all_results)            

Result:

[<Element body at 0x7f6ba576cd68>]
BTC ['$6144.33']
TRX ['$0.023593']
HC []
XZC []
[<Element body at 0x7f6ba57fb4f8>]
BTC []
TRX []
HC ['$1.05']
XZC ['$6.25']
---
{'BTC': '$6144.33', 'TRX': '$0.023593', 'HC': '$1.05', 'XZC': '$6.25'} 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM