[英]How to web scrape data using Python from an html table and store it in a csv file. I am able to extract some parts but not the others
I am beginner in Web scraping and I have become very much interested in the process. 我是Web抓取的初学者,并且对此过程非常感兴趣。 I set for myself a Project that can keep me motivated till I completed the project.
我为自己设定了一个项目,该项目可以使我保持动力直到完成项目。
My Project 我的项目
My Aim is to write a Python Program that goes to my university results page and scrape all the results of a range of students and store each of their marks in each subject in a .csv file or , delimited text file. 我的目标是编写一个Python程序,该程序可转到我的大学成绩页面,并刮取一系列学生的所有结果,并将他们的每个分数存储在.csv文件或带分隔符的文本文件中的每个主题中。 I have gotten the code working to submit the post request to the .asp page.
我已经获得了将提交请求提交到.asp页的代码。 I would appreciate it if you could guide me on how to store the subject wise details in separate columns like:
如果您可以指导我如何将主题明智的信息存储在单独的列中,例如:
Desired Output: 所需输出:
Sl.no,Name,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,..etc Sl.no,Name,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,.. etc
1,Jason,8,9,8,8,8,9..etc 1,杰森,8,9,8,8,8,9..etc
2,Peter,6,8,9,8,7,7..etc 2,彼得,6,8,9,8,7,7..etc
. 。
. 。
. 。
for a series of exam numbers. 一系列的考试编号。
Some Sample Data to try it out 一些样本数据进行尝试
The Results Website : http://result.pondiuni.edu.in/candidate.asp 结果网站 : http : //result.pondiuni.edu.in/candidate.asp
Register Number : 15te1218 寄存器编号 :15te1218
Degree : BTHEE 学位 :BTHEE
Exam : Second 考试 :第二
Could anyone give me directions on how I am to accomplish the task? 谁能给我指导我如何完成任务的方向? Please correct me and would be awesome if you could guide me to solve the problem.
请纠正我,如果您可以指导我解决问题,那将非常好。
Can this be done in a much more simple way ? 可以用更简单的方式完成此操作吗?
In the code below you can see that I have tried to print out the name of the student but it returns an empty set(doesn't work). 在下面的代码中,您可以看到我试图打印出学生的姓名,但是它返回一个空集(不起作用)。 and i don't want it to return the data as a set because there is only one occurrence of that detail.
我不希望它作为一组返回数据,因为只有一次出现该细节。
I do not know how to extract the Subject Names and the corresponding mark of that student from the html table in the results page. 我不知道如何从结果页面的html表中提取该学生的学科名称和相应分数。 Some help with this is needed.
需要一些帮助。
Code: 码:
import requests
from bs4 import BeautifulSoup
import re
import csv
for x in xrange(44,47):
EXAMNO ='15te12'+str(x)
print EXAMNO
data = {"txtregno": EXAMNO,
"cmbdegree": r"BTHEE~\BTHEE\result.mdb", # use raw strings
"cmbexamno": "B",
"dpath": r"\BTHEE\result.mdb",
"dname": "BTHEE",
"txtexamno": "B"}
results_page = requests.post("http://result.pondiuni.edu.in/ResultDisp.asp", data=data).content
soup = BeautifulSoup(results_page, 'html.parser').prettify()
regpa= "<!--Percentage / S.G.P.A : <b>(.+?) </b> -->"
patterngpa =re.compile(regpa)
gpa=re.findall(patterngpa,soup)
print gpa
rename="<font size=3 color=black>(.+?)</font>"
patternname=re.compile(rename)
name=re.findall(patternname,soup)
print (name)
OUTPUT: 输出:
15te1244
[u'8.67']
15te1245
[u'8.8']
[]
15te1246
[u'7.8']
[]
Would be helpful if you could show me how to print it in the desired output format. 如果您可以向我展示如何以所需的输出格式进行打印,将对您有所帮助。
Thanks. 谢谢。
Took a lot of time to find a brute force solution. 花了很多时间找到蛮力解决方案。
import requests
from bs4 import BeautifulSoup
import re
import csv
for x in xrange(44,47):
EXAMNO ='15te12'+str(x)
data = {"txtregno": EXAMNO,
"cmbdegree": r"BTHEE~\BTHEE\result.mdb", # use raw strings
"cmbexamno": "B",
"dpath": r"\BTHEE\result.mdb",
"dname": "BTHEE",
"txtexamno": "B"}
results_page = requests.post("http://result.pondiuni.edu.in/ResultDisp.asp", data=data).content
soup = BeautifulSoup(results_page, 'html.parser').prettify()
string=str(BeautifulSoup(results_page, 'html.parser'))
regpa= "<!--Percentage / S.G.P.A : <b>(.+?) </b> -->"
print (re.search(regpa,string,re.M|re.I )).group(1)
regname="<b>Name of the student : <b><font color=\"black\" size=\"3\">(.*)</font></b></b>"
print (re.search(regname,string,re.M|re.I )).group(1)
regsub="66%\"><font color=\"black\" face=\"arial\" size=\"2\">(.*)</font></td>"
matches=(re.findall(regsub,string,re.M|re.I ))
for i in xrange(len(matches)):
regsubm=">"+matches[i]+"</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"2%\"><font color=\"black\" face=\"arial\" size=\"2\">..</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"7%\"><font color=\"black\" face=\"arial\" size=\"2\">[\xc2]?[\xa0]?[\xc2]?[\xa0]?-</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"1%\"><font color=\"black\" face=\"arial\" size=\"2\">-</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"5%\"><font color=\"black\" face=\"arial\" size=\"2\">-</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"5%\"><font color=\"black\" face=\"arial\" size=\"2\">(.*)</font>"
matchesm=re.findall(regsubm,string,re.M)
print matches[i],'--->',matchesm[0]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.