简体   繁体   中英

Scraping Website using python to search for a specific thing

Language: Python
Website: https://www.curseforge.com/minecraft/mc-mods/ae2-extras/files/3120250
Goal: get the project id and store it as a variable

Snippet from website

<div class="w-full flex justify-between">
    <span>Project ID</span>
    <span>421104</span>
</div>

I want to store the project id 421104 into a variable, I've tried using lxml to get all the divs with the classes 'w-full flex justify-between' but the result is empty

My code:

from lxml import html
import requests

page = requests.get(url)
doc = html.fromstring(page.content)
divs = doc.xpath("//div[@class='w-full flex justify-between']")
print(divs)

Output: []

What am I doing wrong? I have requests, and lxml installed in my environment
Then after I get the list off divs, how would i be able to scrape the 421104 from that first div and store it into a local variable


EDIT 2: I've solved it. Issue was the initial request was getting blocked by cloudfare, I posted my solution as an answer

Solution:

from lxml import html
import requests
import cloudscraper

scraper = cloudscraper.create_scraper()
page = scraper.get(url).text

doc = html.fromstring(page)
divs = doc.xpath("//div[@class='w-full flex justify-between']")
el = divs[0].text_content()
projectID = el.split()[-1]
print(projectID)

My be you got a response as <Response [403]> when you print(page) .Its mean The HTTP 403 is a HTTP status code meaning access to the requested resource is forbidden

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM