繁体   English   中英

如何从这个网页中抓取一个数字(在 python 中)

[英]how to scrape a number from this webpage (in python)

如果有人指导我如何提取数字“28,050”,我将不胜感激在此处输入图片说明

我曾经通过这段代码(python 3)获得该数字:

import requests
import bs4
res_bonbast = requests.get('https://bonbast.com/')
soup_bonbast = bs4.BeautifulSoup(res_bonbast.text,"lxml")
int(float(soup_bonbast.select('#usd1_top')[0].getText()

但最近他们似乎改变了一些东西

您的问题是直到页面加载后才会填充此值。 当您的脚本向您展示时,该元素的 HTML 确实是空白的。 当您在浏览器中加载站点时会发生什么,您可以通过打开开发工具并查看网络选项卡来确认这一点,您首先会获得一些此元素为空白的 HTML。 稍后,调用https://bonbast.com/json返回用于填充元素的值。

您需要做的是自己向 bonbast.com/json 发出请求并从 json 中提取您想要的值,而不是进行 HTML 解析。 您在那里寻找的钥匙是 usd1。

bonbast.com/json 端点可能需要标题中的其他详细信息。 我通过访问 bonbast.com 并打开我的开发工具网络选项卡(在 Chrome 中,ctrl+shift+i >> 网络)并找到对 bonbast.com/json 的请求来捕获下面的 curl 请求。 然后我右键单击它并选择“复制为卷曲”

curl 'https://bonbast.com/json' \
   -H 'authority: bonbast.com' \
   -H 'sec-ch-ua: "Chromium";v="95", ";Not A Brand";v="99"' \
   -H 'accept: application/json, text/javascript, */*; q=0.01' \
   -H 'content-type: application/x-www-form-urlencoded; charset=UTF-8' \
   -H 'x-requested-with: XMLHttpRequest' \
   -H 'sec-ch-ua-mobile: ?0' \
   -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36' \
   -H 'sec-ch-ua-platform: "Linux"' \
   -H 'origin: https://bonbast.com' \
   -H 'sec-fetch-site: same-origin' \
   -H 'sec-fetch-mode: cors' \
   -H 'sec-fetch-dest: empty' \
   -H 'referer: https://bonbast.com/' \
   -H 'accept-language: en-US,en;q=0.9' \
   -H 'cookie: st_bb=0; _gid=GA1.2.587414378.1636538685; __gads=ID=2f6e05bb70db575d-2208cfa441cc00d3:T=1636538685:RT=1636538685:S=ALNI_MaKL18-XZaWbbhlmh2h3RGvYmVKRw; _ga_PZF6SDPF22=GS1.1.1636562265.2.0.1636562265.0; _ga=GA1.2.633937873.1636538685; _gat_gtag_UA_35412804_1=1' \
   --data-raw 'data=0d7e26d17fde20e86b760b00127132d4%2CfTtTZ%2C2021-11-10-16-38-37&webdriver=false' \
   --compressed

结果是:

{ "try1": "2890",
  "month": 8,
  "emami1": "12450000",
  "afn2": "309",
  "afn1": "311",
  "rub2": "397",
  "azadi1_22": "6250000",
  "bhd2": "74870",
  "azn1": "16730",
  "bhd1": "75370",
  "azadi1g": "2350000",
  "bourse": "1904324.2",
  "try2": "2870",
  "cny1": "4450",
  "cny2": "4430",
  "cad1": "22860",
  "cad2": "22760",
  "jpy1": "2495",
  "thb1": "865",
  "usd1": "28420",
  "usd2": "28320",
  "thb2": "860",
  "azn2": "16630",
  "dkk1": "4400",
  "amd2": "590",
  "day": 19,
  "minute": "41",
  "amd1": "595",
  "bitcoin": "68616.85",
  "hour": "20",
  "sar2": "7545",
  "rub1": "400",
  "azadi1g2": "2250000",
  "azadi12": "12000000",
  "eur1": "32725",
  "eur2": "32575",
  "emami12": "12250000",
  "second": "45",
  "omr1": "73825",
  "year": 1400,
  "chf2": "30855",
  "chf1": "31005",
  "azadi1_42": "3700000",
  "jpy2": "2485",
  "kwd2": "93795",
  "kwd1": "94195",
  "sek1": "3280",
  "gbp2": "38090",
  "gbp1": "38290",
  "sek2": "3265",
  "myr1": "6850",
  "myr2": "6820",
  "omr2": "73525",
  "azadi1": "12350000",
  "azadi1_2": "6400000",
  "aud2": "20805",
  "azadi1_4": "3800000",
  "aud1": "20905",
  "dkk2": "4380",
  "inr2": "380",
  "inr1": "382",
  "last_modified": "November 10, 2021 16:00",
  "aed2": "7715",
  "aed1": "7735",
  "iqd2": "1935",
  "qar1": "7805",
  "qar2": "7775",
  "iqd1": "1945",
  "hkd2": "3620",
  "hkd1": "3650",
  "sar1": "7575",
  "created": "November 10, 2021 00:01",
  "sgd2": "20930",
  "sgd1": "21030",
  "ounce": "1854.31",
  "weekday": "Wednesday",
  "mithqal": "5416000",
  "gol18": "1250288",
  "nok1": "3305",
  "nok2": "3290"
}

然而,对你来说是个坏消息。 curl 请求中的参数似乎在一段时间后过期。 我相信正在发生的事情是,当您访问该网站时,您会获得一个 cookie。 该 cookie 是您向 json 端点发出请求的权限,但它会在短时间内过期。

可靠地抓取此页面需要少量工作 - 不仅仅是 StackOverflow 问题/答案。 如果您想更多地讨论如何完成这项工作,请随时给我发电子邮件(在我的个人资料中)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM