繁体   English   中英

某网站python填表后抓取数据

[英]Scraping Data after filling form in python of a website

我尝试使用 python 和 BeautifulSoup 从http://www.educationboardresults.gov.bd/抓取数据。

首先,网站需要填写表格。 填写表格后,网站提供结果。 我在这里附上了两张图片。

提交表格前: https://prnt.sc/w4lo7i

提交后: https://prnt.sc/w4lqd0

我试过以下代码

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
    
    
}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'html5lib')
#Scraping  and by passing Captcha

alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
for digit in captcha:
   value_one, value_two = int(captcha[0]), int(captcha[1])

resultdata['value_s'] = value_one+value_two
r=s.post(url, data=resultdata, headers= headers)

在打印 r.content 时,它显示了第一页的代码。 我想刮第二页。 提前致谢

您正在向错误的 url 发出发布请求。 此外,您应该将两个数字的值相加并使用value_s旁边的结果。 如果您使用的是 bs4 版本 3.7 或更高版本,则以下选择器将为您工作,因为我使用了伪 css 选择器。 底线是你的问题得到了解决。 尝试以下操作:

import requests
from bs4 import BeautifulSoup

link = 'http://www.educationboardresults.gov.bd/'
result_url = 'http://www.educationboardresults.gov.bd/result.php'

resultdata = {
    'sr': '3',
    'et': '2',
    'exam': 'ssc',
    'year': 2012,
    'board': 'chittagong',
    'roll': 102275,
    'reg': 626948,
    'button2': 'Submit',
 }

def get_number(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"html5lib")
    num = 0
    captcha_numbers = soup.select_one("tr:has(> td > #value_s) > td + td").text.split("+")
    for i in captcha_numbers:
        num+=int(i)
    return num

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
    resultdata['value_s'] = get_number(s,link)
    r = s.post(result_url, data=resultdata)
    print(r.text)

我也在努力。

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': "2012",
'board': 'chittagong',
'roll': "102275",
'reg': "626948",

 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'


}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd/index.php'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'lxml')
    # print(soup.prettify())
#Scraping  and by passing Captcha

    alltable =soup.findAll('td')
    captcha = alltable[56].text.split('+')
    print(captcha)
    value_one, value_two = int(captcha[0]), int(captcha[1])
    print(value_one, value_one)

    resultdata['value_s'] = value_one+value_two

    resultdata['button2'] = 'Submit'
    print(resultdata)
    r=s.post("http://www.educationboardresults.gov.bd/result.php", data=resultdata, headers= headers)
    soup = bs(r.content, 'lxml')
    print(soup.prettify())

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM