I'm parsing information of a top 100 list from a site that keeps track of crypto coin prices of the top 1000 coins or something like that using xlml. How can I add the second page to my tree just in case one of my top 100, I'm tracking, falls below the top 100 and ends up on page two? Link to my code: https://github.com/cbat971/CoinScraping/blob/master/WebCrawl.py
I've tried making a "page2" variable, adding "," to page variable, adding a "+" to page variable.
from lxml import html
import requests
import datetime
import time
page = requests.get('https://coinmarketcap.com/', 'https://coinmarketcap.com/2')
tree = html.fromstring(page.content)
If all 100 coins I have on the list are on page one, there is no problem. But as soon as one gets pushed to page two, there is an error and no coins after that get processed through the for
statement at the end.
You could try concatenate both HTML using
page1.content + page2.content
but it will not works because lxml
expects only one <html>
and one <body>
and it will parse only first page and skip other pages.
Run code and you get only one `
from lxml import html
import requests
page1 = requests.get('https://coinmarketcap.com/')
page2 = requests.get('https://coinmarketcap.com/2')
tree = html.fromstring(page1.content + page2.content)
print(tree.cssselect('body'))
You have to process every page separatelly - read it, parse it and get values from HTML - and add results to one list/dictionary
This code gives two <body>
from lxml import html
import requests
for url in ('https://coinmarketcap.com/', 'https://coinmarketcap.com/2'):
page = requests.get(url)
tree = html.fromstring(page.content)
print(tree.cssselect('body'))
EDIT:
from lxml import html
import requests
data = {
'BTC': 'id-bitcoin',
'TRX': 'id-tron',
# ...
'HC': 'id-hypercash',
'XZC': 'id-zcoin',
}
all_results = {}
for url in ('https://coinmarketcap.com/', 'https://coinmarketcap.com/2'):
page = requests.get(url)
tree = html.fromstring(page.content)
print(tree.cssselect('body'))
for key, val in data.items():
result = tree.xpath('//*[@id="' + val + '"]/td[4]/a/text()')
print(key, result)
if result:
all_results[key] = result[0]
print('---')
print(all_results)
Result:
[<Element body at 0x7f6ba576cd68>]
BTC ['$6144.33']
TRX ['$0.023593']
HC []
XZC []
[<Element body at 0x7f6ba57fb4f8>]
BTC []
TRX []
HC ['$1.05']
XZC ['$6.25']
---
{'BTC': '$6144.33', 'TRX': '$0.023593', 'HC': '$1.05', 'XZC': '$6.25'}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.