[英]Parsing specific data using Beautiful Soup
So I have a webpage which has tabular data in it. 所以我有一个网页,其中包含表格数据。 The following is the HTML code for the table:
以下是该表的HTML代码:
<table class="confluenceTable">
<tbody>
<tr>
<th class="confluenceTh">
<p>Prefix</p>
</th>
<th class="confluenceTh">
<p>Group</p>
</th>
<th class="confluenceTh">
<p>Contact</p>
</th>
<th class="confluenceTh">
<p>Dev/Test Lab</p>
</th>
<th class="confluenceTh">
<p>Performance</p>
</th>
</tr>
<tr>
<td class="confluenceTd">
<p> </p>
</td>
<td class="confluenceTd">
<p> </p>
</td>
<td class="confluenceTd">
<p> </p>
</td>
</tr>
<tr>
<th class="confluenceTh">
<p> </p>
</th>
<th class="confluenceTh">
<p> </p>
</th>
<th class="confluenceTh">
<p> </p>
</th>
</tr>
<tr>
<td class="confluenceTd">
<p>SEF00</p>
</td>
<td class="confluenceTd">
<p>APTRA Vision</p>
</td>
<td class="confluenceTd">
<p> </p>
</td>
<td class="confluenceTd">
<p><a href="/somepage">VCD Lab</a> , <a href="/somepage">Test Lab</a></p>
</td>
<td class="confluenceTd">
<p><a href="/display">Perf Lab</a></p>
</td>
</tr>
<tr>
<td class="confluenceTd">
<p>SEF01</p>
</td>
<td class="confluenceTd">
<p>In-Person Bill Payment</p>
</td>
<td class="confluenceTd">
<p>Swamy PKV</p>
</td>
How can I format my Python code so that I just get all data underneath Prefix and Group columns. 如何格式化我的Python代码,以便仅将所有数据保存在Prefix和Group列下面。 So far I have tried this:
到目前为止,我已经尝试过了:
ii=1
data=requests.get(url,auth=(username,password))
sample=data.content
soup=BeautifulSoup(sample,'html.parser')
for row in soup.find_all('tr')[1:154]:
datatocheck.append(row.get_text(separator='\t'))
while(ii<=152):
print datatocheck[ii][0:30]
ii+=1
This gives me the following output: 这给了我以下输出:
SEF00 APTRA Vision VCD Lab
SEF01 In-Person Bill Payment S
But I just want SEF00
(prefix) and APTRA Vision
(group), SEF01
and In-Person Bill Payment
. 但是我只想要
SEF00
(前缀)和APTRA Vision
(小组), SEF01
和In-Person Bill Payment
。 Not the other columns. 没有其他列。
Also, I cant change my HTML code. 另外,我无法更改HTML代码。
How about if u do If SEF00 in ii: 如果您在ii中使用SEF00怎么办:
It may print just the SEF00 它可能只打印SEF00
soup = BeautifulSoup(html, 'lxml')
for row in soup.find_all('tr')[3:]: # remove empty row
tds = [i.get_text(strip=True) for i in row.find_all('td')]
print(tds[0],tds[1])
out: 出:
SEF00 APTRA Vision
SEF01 In-Person Bill Payment
just get all the td in the row, put them in a list, than slice it 只需将所有td放入行中,然后将它们放入列表中,然后将其切成薄片即可
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.