[英]Web scraping in Python - extract a value from website
I'm trying to extract two values from this website:我正在尝试从该网站中提取两个值:
bizportal.co.ilbizportal.co.il
One value is the dollar rate from the right, and from the left the drop/rise in percentage.一个值是右边的美元汇率,左边是百分比的下降/上升。
The problem is that, after I'm getting the dollar rate value, the number is rounded from some reason.问题是,在我得到美元汇率值之后,由于某种原因,这个数字被四舍五入了。 (You can see in the terminal).
(您可以在终端中看到)。 I want to get the exactly number as shown in the website.
我想得到网站上显示的确切数字。
Is there some friendly documentation for web scraping in Python?在 Python 中是否有一些关于 web 抓取的友好文档?
PS: how can I get rid of the pop up Python terminal window when running a code in VS? PS:在VS中运行代码时如何摆脱弹出Python终端window? I just want the output will be in VS - in the interactive window.
我只希望 output 将在 VS 中 - 在交互式 window 中。
my_url = "https://www.bizportal.co.il/forex/quote/generalview/22212222"
uClient = urlopen(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html, "html.parser")
div_class = page_soup.findAll("div",{"class":"data-row"})
print (div_class)
#print(div_class[0].text)
#print(div_class[1].text)
The data is loaded dynamically via Ajax, but you can simulate this request with requests
module:数据通过 Ajax 动态加载,但您可以使用
requests
模块模拟此请求:
import json
import requests
url = 'https://www.bizportal.co.il/forex/quote/generalview/22212222'
ajax_url = "https://www.bizportal.co.il/forex/quote/AjaxRequests/DailyDeals_Ajax?paperId={paperId}&take=20&skip=0&page=1&pageSize=20"
paper_id = url.rsplit('/')[-1]
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
data = requests.get(ajax_url.format(paperId=paper_id), headers=headers).json()
# uncomment this to print all data:
#print(json.dumps(data, indent=4))
# print first one
print(data['Data'][0]['rate'], data['Data'][0]['PrecentageRateChange'])
Prints:印刷:
3.4823 -0.76%
The problem is this element is being dynamically updated with Javascript.问题是这个元素正在使用 Javascript 动态更新。 You will not be able to scrape the 'up to date' value with urllib or requests.
您将无法使用 urllib 或 requests 抓取“最新”值。 When the page is loaded, it has a recent value populated (likely from a database) and then it is replaced with the real time number via Javascript.
加载页面时,会填充最近的值(可能来自数据库),然后通过 Javascript 将其替换为实时数字。
In this case it would be better to use something like Selenium, to load the webpage - this allows the javascript to execute on the page, and then scrape the numbers.在这种情况下,最好使用 Selenium 之类的东西来加载网页 - 这允许 javascript 在页面上执行,然后抓取数字。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.add_argument("--headless") # allows you to scrape page without opening the browser window
driver = webdriver.Chrome('./chromedriver', options=options)
driver.get("https://www.bizportal.co.il/forex/quote/generalview/22212222")
time.sleep(1) # put in to allow JS time to load, sometimes works without.
values = driver.find_elements_by_class_name('num')
price = values[0].get_attribute("innerHTML")
change = values[1].find_element_by_css_selector("span").get_attribute("innerHTML")
print(price, "\n", change)
Output: Output:
╰─$ python selenium_scrape.py
3.483
-0.74%
You should familiarize yourself with Selenium, understand how to set it up, and run it - this includes installing the browser (in this case I am using Chrome, but you can use others), understanding where to get the browser driver (Chromedriver in this case) and understand how to parse the page.您应该熟悉 Selenium,了解如何设置和运行它 - 这包括安装浏览器(在这种情况下我使用的是 Chrome,但您可以使用其他浏览器),了解从哪里获取浏览器驱动程序(Chromedriver 在这个case) 并了解如何解析页面。 You can learn all about it here https://www.selenium.dev/documentation/en/
您可以在这里了解所有信息https://www.selenium.dev/documentation/en/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.