简体   繁体   English

如何使用Python从html表中通过Web抓取数据并将其存储在csv文件中。 我可以提取某些部分,但不能提取其他部分

[英]How to web scrape data using Python from an html table and store it in a csv file. I am able to extract some parts but not the others

I am beginner in Web scraping and I have become very much interested in the process. 我是Web抓取的初学者,并且对此过程非常感兴趣。 I set for myself a Project that can keep me motivated till I completed the project. 我为自己设定了一个项目,该项目可以使我保持动力直到完成项目。

My Project 我的项目

My Aim is to write a Python Program that goes to my university results page and scrape all the results of a range of students and store each of their marks in each subject in a .csv file or , delimited text file. 我的目标是编写一个Python程序,该程序可转到我的大学成绩页面,并刮取一系列学生的所有结果,并将他们的每个分数存储在.csv文件或带分隔符的文本文件中的每个主题中。 I have gotten the code working to submit the post request to the .asp page. 我已经获得了将提交请求提交到.asp页的代码。 I would appreciate it if you could guide me on how to store the subject wise details in separate columns like: 如果您可以指导我如何将主题明智的信息存储在单独的列中,例如:

Desired Output: 所需输出:

Sl.no,Name,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,..etc Sl.no,Name,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,.. etc

1,Jason,8,9,8,8,8,9..etc 1,杰森,8,9,8,8,8,9..etc

2,Peter,6,8,9,8,7,7..etc 2,彼得,6,8,9,8,7,7..etc

.

.

.

for a series of exam numbers. 一系列的考试编号。

Some Sample Data to try it out 一些样本数据进行尝试

The Results Website : http://result.pondiuni.edu.in/candidate.asp 结果网站http : //result.pondiuni.edu.in/candidate.asp

Register Number : 15te1218 寄存器编号 :15te1218

Degree : BTHEE 学位 :BTHEE

Exam : Second 考试 :第二

Could anyone give me directions on how I am to accomplish the task? 谁能给我指导我如何完成任务的方向? Please correct me and would be awesome if you could guide me to solve the problem. 请纠正我,如果您可以指导我解决问题,那将非常好。

Can this be done in a much more simple way ? 可以用更简单的方式完成此操作吗?

In the code below you can see that I have tried to print out the name of the student but it returns an empty set(doesn't work). 在下面的代码中,您可以看到我试图打印出学生的姓名,但是它返回一个空集(不起作用)。 and i don't want it to return the data as a set because there is only one occurrence of that detail. 我不希望它作为一组返回数据,因为只有一次出现该细节。

I do not know how to extract the Subject Names and the corresponding mark of that student from the html table in the results page. 我不知道如何从结果页面的html表中提取该学生的学科名称和相应分数。 Some help with this is needed. 需要一些帮助。

Code: 码:

import requests
from bs4 import BeautifulSoup 
import re
import csv

for x in xrange(44,47):

    EXAMNO ='15te12'+str(x)
    print EXAMNO

    data = {"txtregno": EXAMNO,
        "cmbdegree": r"BTHEE~\BTHEE\result.mdb", # use raw strings
        "cmbexamno": "B",
        "dpath": r"\BTHEE\result.mdb",
        "dname": "BTHEE",
        "txtexamno": "B"}

results_page = requests.post("http://result.pondiuni.edu.in/ResultDisp.asp", data=data).content
soup = BeautifulSoup(results_page, 'html.parser').prettify()
regpa= "<!--Percentage / S.G.P.A : <b>(.+?) </b>&nbsp;&nbsp;&nbsp; -->"
patterngpa =re.compile(regpa)
gpa=re.findall(patterngpa,soup)
print gpa
rename="<font size=3 color=black>(.+?)</font>"
patternname=re.compile(rename)
name=re.findall(patternname,soup)
    print (name)

OUTPUT: 输出:

15te1244
[u'8.67']
15te1245
[u'8.8']
[]
15te1246
[u'7.8']
[]

Would be helpful if you could show me how to print it in the desired output format. 如果您可以向我展示如何以所需的输出格式进行打印,将对您有所帮助。

Thanks. 谢谢。

Took a lot of time to find a brute force solution. 花了很多时间找到蛮力解决方案。

import requests
from bs4 import BeautifulSoup 
import re
import csv
for x in xrange(44,47):
    EXAMNO ='15te12'+str(x)
    data = {"txtregno": EXAMNO,
    "cmbdegree": r"BTHEE~\BTHEE\result.mdb", # use raw strings
    "cmbexamno": "B",
    "dpath": r"\BTHEE\result.mdb",
    "dname": "BTHEE",
    "txtexamno": "B"}
    results_page = requests.post("http://result.pondiuni.edu.in/ResultDisp.asp", data=data).content
    soup = BeautifulSoup(results_page, 'html.parser').prettify()
    string=str(BeautifulSoup(results_page, 'html.parser'))
    regpa= "<!--Percentage / S.G.P.A : <b>(.+?) </b>&nbsp;&nbsp;&nbsp; -->"
    print (re.search(regpa,string,re.M|re.I )).group(1) 
    regname="<b>Name of the student : <b><font color=\"black\" size=\"3\">(.*)</font></b></b>"
    print (re.search(regname,string,re.M|re.I )).group(1)
    regsub="66%\"><font color=\"black\" face=\"arial\" size=\"2\">(.*)</font></td>"
    matches=(re.findall(regsub,string,re.M|re.I ))

    for i in xrange(len(matches)):
        regsubm=">"+matches[i]+"</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"2%\"><font color=\"black\" face=\"arial\" size=\"2\">..</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"7%\"><font color=\"black\" face=\"arial\" size=\"2\">[\xc2]?[\xa0]?[\xc2]?[\xa0]?-</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"1%\"><font color=\"black\" face=\"arial\" size=\"2\">-</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"5%\"><font color=\"black\" face=\"arial\" size=\"2\">-</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"5%\"><font color=\"black\" face=\"arial\" size=\"2\">(.*)</font>"
        matchesm=re.findall(regsubm,string,re.M)
        print matches[i],'--->',matchesm[0]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我无法使用 python 从给定网站上抓取网络数据 - I am not able to scrape the web data from the given website using python 我正在尝试通过网络抓取网页并将数据存储在 CSV 文件中。 但我似乎无法让我的代码工作 - I'm trying to web scrape a webpage and store the data in a CSV file. But I can't seem to get my code to work 如何使用 Python 在 HTML 页面上抓取仅在鼠标悬停在地图/图像部分上时显示的数据? - How do I web scrape data on an HTML page that only shows up on a mouse hover over parts of a map/image using Python? 我无法使用 MIME 格式的数据刮取表格内容:application/octet-stream using python - I am not able to Scrape Table content with MIME format of data:application/octet-stream using python 如何从 Python 中的 CSV 文件中抓取特定数据? - How do I scrape specific data from a CSV file in Python? Python:我正在尝试抓取网页,但找不到 html - Python: I am trying to web scrape a page but I am not able to find the html 无法使用 python 脚本从网站上抓取 html 表 - Not able to scrape html table from a website using python script 如何从python中的html表中抓取数据 - How to scrape data from html table in python 如何使用 selenium 和 python 抓取数据,我正在尝试提取标题 div 标签中的所有数据 - How to scrape data using selenium and python, I am trying to extract all the data which is in title div tag 我需要使用html页面中的python提取一些数据 - I need to extract some data using python from a html page
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM