简体   繁体   English

某网站python填表后抓取数据

[英]Scraping Data after filling form in python of a website

I have tried to scrape data from http://www.educationboardresults.gov.bd/ with python and BeautifulSoup.我尝试使用 python 和 BeautifulSoup 从http://www.educationboardresults.gov.bd/抓取数据。

Firstly, website need to fill the form.首先,网站需要填写表格。 After filling the form the website provide results.填写表格后,网站提供结果。 I have attached two image here.我在这里附上了两张图片。

Before Submitting Form: https://prnt.sc/w4lo7i提交表格前: https://prnt.sc/w4lo7i

After Submission: https://prnt.sc/w4lqd0提交后: https://prnt.sc/w4lqd0

I have tried with following code我试过以下代码

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
    
    
}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'html5lib')
#Scraping  and by passing Captcha

alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
for digit in captcha:
   value_one, value_two = int(captcha[0]), int(captcha[1])

resultdata['value_s'] = value_one+value_two
r=s.post(url, data=resultdata, headers= headers)

While printing r.content it is showing first page's code.在打印 r.content 时,它显示了第一页的代码。 I want to scrape the second page.我想刮第二页。 Thanks in Advance提前致谢

You are making post requests to the wrong url.您正在向错误的 url 发出发布请求。 Moreover, you are supposed to add the value of two numbers and use the result right next to value_s .此外,您应该将两个数字的值相加并使用value_s旁边的结果。 If you are using bs4 version 3.7 or later, the following selector will work for you as I've used pseudo css selector.如果您使用的是 bs4 版本 3.7 或更高版本,则以下选择器将为您工作,因为我使用了伪 css 选择器。 The bottom line is your issue is solved.底线是你的问题得到了解决。 Try the following:尝试以下操作:

import requests
from bs4 import BeautifulSoup

link = 'http://www.educationboardresults.gov.bd/'
result_url = 'http://www.educationboardresults.gov.bd/result.php'

resultdata = {
    'sr': '3',
    'et': '2',
    'exam': 'ssc',
    'year': 2012,
    'board': 'chittagong',
    'roll': 102275,
    'reg': 626948,
    'button2': 'Submit',
 }

def get_number(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"html5lib")
    num = 0
    captcha_numbers = soup.select_one("tr:has(> td > #value_s) > td + td").text.split("+")
    for i in captcha_numbers:
        num+=int(i)
    return num

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
    resultdata['value_s'] = get_number(s,link)
    r = s.post(result_url, data=resultdata)
    print(r.text)

I am also trying.我也在努力。

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': "2012",
'board': 'chittagong',
'roll': "102275",
'reg': "626948",

 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'


}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd/index.php'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'lxml')
    # print(soup.prettify())
#Scraping  and by passing Captcha

    alltable =soup.findAll('td')
    captcha = alltable[56].text.split('+')
    print(captcha)
    value_one, value_two = int(captcha[0]), int(captcha[1])
    print(value_one, value_one)

    resultdata['value_s'] = value_one+value_two

    resultdata['button2'] = 'Submit'
    print(resultdata)
    r=s.post("http://www.educationboardresults.gov.bd/result.php", data=resultdata, headers= headers)
    soup = bs(r.content, 'lxml')
    print(soup.prettify())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM