![](/img/trans.png)
[英]Using selenium to retrieve data from webpage - not retrieving all data
[英]retrieving essential data from a webpage using python
以下是我使用urlretrieve(urllib)下載的網頁的一部分。 我只想將下面給出的網頁中的數據寫入另一個文本文件中,如下所示:
ENGINEERING MATHEMATICS-IV, 4 ,36 ,40 , F
ENVIRONMENTAL STUDIES, 47, 36, 83 , p
...
..
.
我應該使用哪個模塊?所有命令?
提前謝謝.. :)
<td>ENGINEERING MATHEMATICS-IV</td>
<td align=center>4</td>
<td align=center>36</td>
<td align=center>40</td>
<td align=center>F</td>
</tr>
<tr align=left bgcolor='#FFFFFF'> <td>EIT402 </td>
<td>ENVIRONMENTAL STUDIES</td>
<td align=center>47</td>
<td align=center>36</td>
<td align=center>83</td>
<td align=center>P</td>
</tr>
<tr align=left bgcolor='#DA9292'> <td>EIT403 </td>
<td>SYSTEM PROGRAMMING</td>
<td align=center>40</td>
<td align=center>36</td>
<td align=center>76</td>
<td align=center>P</td>
</tr>
<tr align=left bgcolor='#FFFFFF'> <td>EIT404 </td>
<td>MICROPROCESSOR BASED DESIGN</td>
<td align=center>3</td>
<td align=center>35</td>
<td align=center>38</td>
<td align=center>F</td>
</tr>
<tr align=left bgcolor='#DA9292'> <td>EIT405 </td>
<td>PROGRAMMING PARADIGMS</td>
<td align=center>42</td>
<td align=center>36</td>
<td align=center>78</td>
<td align=center>P</td>
</tr>
<tr align=left bgcolor='#FFFFFF'> <td>EIT406 </td>
<td>COMMUNICATION SYSTEMS</td>
<td align=center>9</td>
<td align=center>35</td>
<td align=center>44</td>
<td align=center>F</td>
</tr>
<tr align=left bgcolor='#DA9292'> <td>EIT407 </td>
<td>DATA STRUCTURE LAB</td>
<td align=center>10</td>
<td align=center>35</td>
<td align=center>45</td>
<td align=center>F</td>
</tr>
<tr align=left bgcolor='#FFFFFF'> <td>EIT408 </td>
<td>PROGRAMMING ENVIRONMENTS LAB</td>
<td align=center>20</td>
<td align=center>25</td>
<td align=center>45</td>
<td align=center>F</td>
</tr>
import urllib2
import BeautifulSoup
def main():
infname = 'htmltable.html'
outfname = 'courses.txt'
with open(infname) as inf:
html = inf.read()
doc = BeautifulSoup.BeautifulSoup(html)
table = doc.find('table',{'id':'content'})
with open(outfname, 'w') as outf:
for row in table.findAll('tr'):
id,name,a,b,c,d = [cell.getText().strip() for cell in row.findAll('td')]
outf.write("{name}, {a}, {b}, {c}, {d}\n".format(id=id, name=name, a=a, b=b, c=c, d=d))
if __name__=="__main__":
main()
如果您假設保存的頁面開始像
<html><head><title>Data Table</title></head><body>
<table id='content'>
<tr align=left bgcolor='#FFFFFF'> <td>EIT402 </td>
<td>ENGINEERING MATHEMATICS-IV</td>
<td align=center>4</td>
<td align=center>36</td>
<td align=center>40</td>
<td align=center>F</td>
</tr>
導致
ENGINEERING MATHEMATICS-IV, 4, 36, 40, F
ENVIRONMENTAL STUDIES, 47, 36, 83, P
SYSTEM PROGRAMMING, 40, 36, 76, P
MICROPROCESSOR BASED DESIGN, 3, 35, 38, F
PROGRAMMING PARADIGMS, 42, 36, 78, P
COMMUNICATION SYSTEMS, 9, 35, 44, F
DATA STRUCTURE LAB, 10, 35, 45, F
PROGRAMMING ENVIRONMENTS LAB, 20, 25, 45, F
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.