簡體   English   中英

某網站python填表后抓取數據

[英]Scraping Data after filling form in python of a website

我嘗試使用 python 和 BeautifulSoup 從http://www.educationboardresults.gov.bd/抓取數據。

首先,網站需要填寫表格。 填寫表格后,網站提供結果。 我在這里附上了兩張圖片。

提交表格前: https://prnt.sc/w4lo7i

提交后: https://prnt.sc/w4lqd0

我試過以下代碼

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
    
    
}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'html5lib')
#Scraping  and by passing Captcha

alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
for digit in captcha:
   value_one, value_two = int(captcha[0]), int(captcha[1])

resultdata['value_s'] = value_one+value_two
r=s.post(url, data=resultdata, headers= headers)

在打印 r.content 時,它顯示了第一頁的代碼。 我想刮第二頁。 提前致謝

您正在向錯誤的 url 發出發布請求。 此外,您應該將兩個數字的值相加並使用value_s旁邊的結果。 如果您使用的是 bs4 版本 3.7 或更高版本,則以下選擇器將為您工作,因為我使用了偽 css 選擇器。 底線是你的問題得到了解決。 嘗試以下操作:

import requests
from bs4 import BeautifulSoup

link = 'http://www.educationboardresults.gov.bd/'
result_url = 'http://www.educationboardresults.gov.bd/result.php'

resultdata = {
    'sr': '3',
    'et': '2',
    'exam': 'ssc',
    'year': 2012,
    'board': 'chittagong',
    'roll': 102275,
    'reg': 626948,
    'button2': 'Submit',
 }

def get_number(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"html5lib")
    num = 0
    captcha_numbers = soup.select_one("tr:has(> td > #value_s) > td + td").text.split("+")
    for i in captcha_numbers:
        num+=int(i)
    return num

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
    resultdata['value_s'] = get_number(s,link)
    r = s.post(result_url, data=resultdata)
    print(r.text)

我也在努力。

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': "2012",
'board': 'chittagong',
'roll': "102275",
'reg': "626948",

 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'


}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd/index.php'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'lxml')
    # print(soup.prettify())
#Scraping  and by passing Captcha

    alltable =soup.findAll('td')
    captcha = alltable[56].text.split('+')
    print(captcha)
    value_one, value_two = int(captcha[0]), int(captcha[1])
    print(value_one, value_one)

    resultdata['value_s'] = value_one+value_two

    resultdata['button2'] = 'Submit'
    print(resultdata)
    r=s.post("http://www.educationboardresults.gov.bd/result.php", data=resultdata, headers= headers)
    soup = bs(r.content, 'lxml')
    print(soup.prettify())

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM