某网站python填表后抓取数据

Question

I have tried to scrape data from http://www.educationboardresults.gov.bd/ with python and BeautifulSoup.我尝试使用 python 和 BeautifulSoup 从http://www.educationboardresults.gov.bd/抓取数据。

Firstly, website need to fill the form.首先，网站需要填写表格。 After filling the form the website provide results.填写表格后，网站提供结果。 I have attached two image here.我在这里附上了两张图片。

Before Submitting Form: https://prnt.sc/w4lo7i提交表格前： https://prnt.sc/w4lo7i

After Submission: https://prnt.sc/w4lqd0提交后： https://prnt.sc/w4lqd0

I have tried with following code我试过以下代码

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
    
    
}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'html5lib')
#Scraping  and by passing Captcha

alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
for digit in captcha:
   value_one, value_two = int(captcha[0]), int(captcha[1])

resultdata['value_s'] = value_one+value_two
r=s.post(url, data=resultdata, headers= headers)

While printing r.content it is showing first page's code.在打印 r.content 时，它显示了第一页的代码。 I want to scrape the second page.我想刮第二页。 Thanks in Advance提前致谢

Answer 1

You are making post requests to the wrong url.您正在向错误的 url 发出发布请求。 Moreover, you are supposed to add the value of two numbers and use the result right next to value_s .此外，您应该将两个数字的值相加并使用value_s旁边的结果。 If you are using bs4 version 3.7 or later, the following selector will work for you as I've used pseudo css selector.如果您使用的是 bs4 版本 3.7 或更高版本，则以下选择器将为您工作，因为我使用了伪 css 选择器。 The bottom line is your issue is solved.底线是你的问题得到了解决。 Try the following:尝试以下操作：

import requests
from bs4 import BeautifulSoup

link = 'http://www.educationboardresults.gov.bd/'
result_url = 'http://www.educationboardresults.gov.bd/result.php'

resultdata = {
    'sr': '3',
    'et': '2',
    'exam': 'ssc',
    'year': 2012,
    'board': 'chittagong',
    'roll': 102275,
    'reg': 626948,
    'button2': 'Submit',
 }

def get_number(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"html5lib")
    num = 0
    captcha_numbers = soup.select_one("tr:has(> td > #value_s) > td + td").text.split("+")
    for i in captcha_numbers:
        num+=int(i)
    return num

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
    resultdata['value_s'] = get_number(s,link)
    r = s.post(result_url, data=resultdata)
    print(r.text)

Answer 2

I am also trying.我也在努力。

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': "2012",
'board': 'chittagong',
'roll': "102275",
'reg': "626948",

 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'


}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd/index.php'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'lxml')
    # print(soup.prettify())
#Scraping  and by passing Captcha

    alltable =soup.findAll('td')
    captcha = alltable[56].text.split('+')
    print(captcha)
    value_one, value_two = int(captcha[0]), int(captcha[1])
    print(value_one, value_one)

    resultdata['value_s'] = value_one+value_two

    resultdata['button2'] = 'Submit'
    print(resultdata)
    r=s.post("http://www.educationboardresults.gov.bd/result.php", data=resultdata, headers= headers)
    soup = bs(r.content, 'lxml')
    print(soup.prettify())

某网站python填表后抓取数据

问题描述

2 个解决方案

解决方案1
0 2020-12-17 11:31:27

解决方案2
0 2020-12-17 11:40:19

某网站python填表后抓取数据

问题描述

2 个解决方案

解决方案1 0 2020-12-17 11:31:27

解决方案2 0 2020-12-17 11:40:19

解决方案1
0 2020-12-17 11:31:27

解决方案2
0 2020-12-17 11:40:19