簡體   English   中英

使用輸入 Selenium 和 BeautifulSoup 抓取網站?

[英]Scraping a website using inputs with Selenium and BeautifulSoup?

我正在嘗試通過wester union send money -Website 獲取當前與阿根廷比索的“歐元藍”匯率。 西聯匯款是唯一一家為您提供也在街頭交易的真實匯率的公司。 如果您對阿根廷第二個貨幣交易市場的發展方式感興趣,請查閱 Dollar-Blue。

我的目標是獲取歐元對阿根廷比索的當前匯率。 如果訪問該網站,您必須先單擊“接受”按鈕,然后輸入您要匯款到的國家/地區的名稱,只有在這一步之后您才能看到匯率。

我首先嘗試使用請求,因為它不處理我切換到 selenium 的 java 腳本並且現在非常接近。

我的代碼如下所示:

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

WesternUnion = 'https://www.westernunion.com/de/en/web/send-money'

# create a new Chrome session
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(WesternUnion)

python_button = driver.find_element_by_id('button-fraud-warning-accept')
python_button.click()

time.sleep(0.25)
python_button = driver.find_element_by_id('country')
python_button.click() #click fhsu link
time.sleep(0.15)
text_area = driver.find_element_by_id('country')
text_area.send_keys("Argentina")

soup = BeautifulSoup(driver.page_source, 'lxml')

div = soup.find('div', id="OptimusApp")
div2 = soup.find('div', class_="text-center")

問題是,如果我使用 python(屏幕截圖使用 python 自動導航),它不會顯示匯率,而如果我手動執行完全相同的操作(屏幕截圖,手動導航),它會顯示匯率。

我對抓取和 python 很陌生,有人有解決這個問題的簡單方法嗎?

我稍微修改了你的代碼,添加了幾個可選參數,在執行時我得到了以下結果:

  • 代碼塊:

     from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options, executable_path=r'C:\\WebDrivers\\chromedriver.exe') driver.get('https://www.westernunion.com/de/en/web/send-money') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#button-fraud-warning-accept"))).click() python_button = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#country"))) python_button.click() python_button.send_keys("Argentina") print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span#smoExchangeRate"))).text)
  • 控制台輸出:

     1.00 EUR = Argentine Peso (ARS)
  • 觀察:我的觀察與您的相似,沒有顯示匯率

快照


深潛

在檢查網頁的DOM 樹時,您會發現一些<script><link>標簽引用了具有關鍵字dist 的JavaScript 舉個例子:

  • <script src="/content/wucom/dist/2.7.1.8f57d9b1/js/smo-configs/smo-config.de.js"></script>
  • <link rel="stylesheet" type="text/css" href="/content/wucom/dist/2.7.1.8f57d9b1/css/responsive_css.min.css">
  • <link rel="stylesheet" href="https://nebula-cdn.kampyle.com/resources/dist/assets/css/liveform-web-vendor-f84dfc85d6.css">
  • <link rel="stylesheet" href="https://nebula-cdn.kampyle.com/resources/dist/assets/css/kampyle/liveform-web-style-a4ce961d15.css">
  • <script src="https://nebula-cdn.kampyle.com/resources/dist/assets/js/liveform-web-vendor-919a2c71c3.js"></script>
  • <script src="https://nebula-cdn.kampyle.com/resources/dist/assets/js/liveform-web-app-2c4e3adeb6.js"></script>

這清楚地表明該網站受到機器人管理服務提供商Distil Networks 的保護, ChromeDriver的導航被檢測到並隨后被阻止


蒸餾

根據文章確實有一些關於 Distil.it...

Distil 通過觀察站點行為和識別爬蟲特有的模式來保護站點免受自動內容爬取機器人的侵害。 當 Distil 在一個站點上識別出惡意機器人時,它會創建一個列入黑名單的行為配置文件,並將其部署給所有客戶。 類似於機器人防火牆,Distil 檢測模式並做出反應。

更遠,

"One pattern with Selenium was automating the theft of Web content" ,Distil 首席執行官 Rami Essaid 上周在接受采訪時說。 "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


參考

您可以在以下位置找到一些詳細的討論:

更改率來自https://www.westernunion.com/wuconnect/prices/catalog並帶有 POST 請求。 例如:

  • 假設一個$payload變量包含:
{
  "header_request": {
    "version": "0.5",
    "request_type": "PRICECATALOG",
    "correlation_id": "web-x",
    "transaction_id": "web-x"
  },
  "sender": {
    "client": "WUCOM",
    "channel": "WWEB",
    "cty_iso2_ext": "DE",
    "curr_iso3": "EUR",
    "funds_in": "*",
    "send_amount": 300,
    "air_requested": "Y",
    "efl_type": "STATE",
    "efl_value": "CA"
  },
  "receiver": {
    "curr_iso3": "ARS",
    "cty_iso2_ext": "AR",
    "cty_iso2": "AR"
  }
}
  • 並假設一個無辜的用戶代理
  • 然后curl -s 'https://www.westernunion.com/wuconnect/prices/catalog' --data-raw "$payload" | jq '.services_groups[0].pay_groups[0] | .fx_rate' curl -s 'https://www.westernunion.com/wuconnect/prices/catalog' --data-raw "$payload" | jq '.services_groups[0].pay_groups[0] | .fx_rate' curl -s 'https://www.westernunion.com/wuconnect/prices/catalog' --data-raw "$payload" | jq '.services_groups[0].pay_groups[0] | .fx_rate'會得到它。

它曾經有效(直到幾周前)。

但是端點現在受到保護:它期望從瀏覽器計算出一組自定義的加密標頭,並嚴重依賴混淆和涉及的 Javascript。 這是它們的樣子:

X-NYUPe9Cs-a: IExHQTfwEnWwuyWbWjmR2fyBEQW9X9nnqFqIio78zzCKFA78iBDudN=NnOpQd=725d_urqfAN2sKK7UOdTnkCpUqFvQ9TF2nK=M1jDmrMBYy-4iq5kUqSdEN1PjBjEC=Nx742P1np7qAKK8q8qWd5UQIQ8Wqnqx51np7kIavPFenB9dSvnKou0A2nfv7qE-q7k_2EdNyuKffAYxcqbnjnCYIDfe=IKCc8JdPzpDecynafP1fVKq=z2SJCKiaMXu-Dxp2z5CpfznOPcs4WFH2D4C5JTTnDDUQ7vOPFVKnKCdcamPqOnK8wOQb9FYoxWs=Pksn4vmeC5Ia9EoVReH8uj0q_PRu2q522kk-9jnRTYJIP9VWP_50hhxPMds9eX_kAC2DbBnKzy24sICkO7bkkyAT82s5YuKECP=fnzXixxC8=81WX4jqnNBJ_qxbbqV=InUWmKYWimbUaB5qwOCA2iqSXNDw25PmHq8_2XEAx7nTnjkwYS2qvNBa8sAjxxHU8ibNFr_iiZH=4JuS2Q=RJrnTDonA1vFxKe812s-CMJ8HFay0VqrC2kQZVzCV2w0bqZyEuJksehxE22W8-Smd5V5XnvENHFcn72wkeN=boc=PIbv=XYNqEknrCyEX2r8BJvYCipnKdnkohrIvPovqfJMB7emybSTy2Eeu9h9VBrqYMW2NrXb2wc1kxC5WJAFv_cXE_vqsvRqeS-wYJ9vD1Y-1Cvo8RRqkFWAXuq1CBYXndSQ_A1e0aqO7sTB=nyKFd1=rJ4=z15z-qFMEQfy_x=qedJTzvWf8SE9yMqVCYUuSrhMnpEFdeJYiEdX-KS2In0-uZ0zzrn2qn27zY-jo7qkrvrq8V8v2aACd7PFEnMbCyUUUI-MdTcD8nCDiC2yuPOpbUcwID7Y1d=2aIubdAhErSn82C9FnSm9IVj8Z_WHwBvBPCI_o=_2pdRVk0jS5qYb_OjyVrrxqXnZOp9TVnAVnWZOWn798a8qhX-hYuFjJ-z84rzQRo2M70vHAMuNSMT_8yqkrujEr7JcyU2CmY1NKpev0w8R19227=qVqdemsq00nx-UAYz0=UYA2hT2IaqoqRie7Jbzjikb2snnnQynoHUpnYxRVs9ORc7I2MVhqqCVonnVk5Pi1xns2--iqqSKH8Rhium-nRcWurBu=TFiZ-5Qq-_WDiMQ5n7BqmAZkjWZM97MNkqakw8nq9CXav2fq4OqUok997VTOFkP7DEm-W5ckkwInQNMBNqTrK25DnSHRiyP5m5zqh1RjWp48f_9QCO2HiPS9A8j58zoF_8abn0H1qUERd_Cq8-7zqOnkEeAAWCywi18wUD5qfbQd22BJDNq90sMSbNVsJy0P2CBf-hq9fjSCB=uA5y8xT2-CJunFwUCx85ujxiq-bu5BAbSpqUCAXDP8iq02ET5-xRq7CD22n=E4keqVnKpzq2=RUKWP_jDnsiKRn4xxsRM0QYnbCC=m2KjCE9BjJ1nrn8EDvUS52bmaixqosRq5SNOPEHKyrQy8nqI9E9OAMYm5=TpVNvn-oqeDF_-jkcqIdyHqn1QYxaZbn4xVFqIOzQ9eV7A9QbC5zPcPeD=qqpqqK=YxNzKwTSCOnA70SrhiB2r1VkqKuuBJQYZoIC_87Mmuo8znpQnH29fI7Oh99sKO5aoEQIMOrIDwQDZvWqwwH=ZKnnn8T=5o9MTdDkpr472DPdqOEq8Ffii0q00r8OwkZX_oXY2UEKdCaX88zZamSqaY8iZzqiIYdeMjqMFKqVAv-82PxBWQv1Kr1OibYSh0QTp14BqBhEf-WKrVECI_y7517nZa8ndFpjznkfcnY2KufY0iFwnx2zx99iuUbF84nerZH88Rxx=pKBbsjeqJZ-0xZScnrn9hReJ--oh40mcxMXn1V0PzwcMaEACo0dWouDZeZYHViqd9RQAnso2DIF-wI-Pe_q5srKK8nmCZNI2hZqwjzOM7bwF4_4-S=9BzYFDaYw0SknMJTq9VReaM297ir-CYsdM9VN29TpDRnC=8aQ5o9yXZpEDyfqmJuwzs7N7he8FPrfIdDVK5iaW8Jm8YcHnqnno7EHSqKeTRNuzkeHqcn0u87OX=ByhQMQJ4QacaxqqFVmPqQEHSVbx1PsQDq780PWDKbvK5PBMnZksBZm0VIOHxu_q2xnfPWsixuqaIm2sXn2Jz2yByvdNeT5r2F14zEaiiEFfNqICZ_DHCXpr2K4HURNd5n_vyJTe2UVakZE_9T01W9cFUxBOur0xfN0=h4vmOoUAnwISSDxc5EmAefWviW2PvqevpnnS7YuMPMY5aHi2c2RrP=i-mfPpKzRSHpAn82sJ9izMdWcWq=qI5O_UBm==vFHrFOzHQK8AH9qcRM8=KHpwyoV-b0WzuErxZhZmMV_iKors2JCAeWn-jn-q_Mrqau1Xz88nTBQFO=vnKPfFoqY9Z81KUqyAn2N5dwbnKWHUZh4Ke4OnyOr=22=rKZneB9PmQDUDq=97vOSqqNq=bHNriSf=xT48cXy7AqWOnncwEqwbVcA25ds8O8S0WI9=ipEfIyiiJ7qSMoHY=kn7rwiE94jsVx5n7Syj=m58Fqvi=HCFI0Bwf8byFhWbeJsAK5UaDqchCY5qC9n-OUqmeJHay8OAqm-HQPnP9qBfyd08nini0FsrdvHmru4qA=sK4OKmzcY_wSj8D8D2jBQWHF2avq4UP8-D2Ysh4C_bXXhqmqK9RPyuXRoeC5Oad-FmUXy_5F_r0OKEnrAMC
X-NYUPe9Cs-f: A_v7kP18AQAAbfq9_kCtmTqfX2Eq0otHnwqUQCck5dPjX88Nxz2rTVnAnVxYAcmzs1ScuAA7wH8AADQwAAAAAA==
X-NYUPe9Cs-b: -8qa21q
X-NYUPe9Cs-c: AOBWjv18AQAAqntYtdrBc9F0C0KawiRISfcOH_ruhEoV4NNn-IemnXnq5vi1
X-NYUPe9Cs-d: AAaixIihDKqOocqASZAQjICihCKHpi15Rub4tUEPqzn1Pxi1AAd7zRXqBBDKOTmM_r5nbhq
X-NYUPe9Cs-z: q

這組標頭僅在有限的時間內有效(不超過 24 小時 AFAICT)。

我很好奇有人會進一步查明邏輯所在的位置(一些加密初始化向量可能由初始頁面加載期間傳送的 cookie 提供)。 如果是這樣,node.js 可以計算這組標頭。

只是為了添加一些關於此的信息。 我設法通過使用看起來不受保護的 westernunion.ru 節點使其工作了一段時間(因為我可以在沒有所有這些標頭的情況下獲得此信息)不幸的是 westernunion.ru 端點已被刪除或至少沒有工作了。 因此,解決方案可能是為 API 找到一個尚未受保護的端點。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM