[英]Python Web Scraping Html Table using beautiful soup
This is my HTML Table. 这是我的HTML表格。
<table class="table_c" id="myd">
<tbody>
<tr class="grp">
<th class="col>MyGrp1</th>
</tr>
<tr class="item">
<th class="col label" scope="row">Item0.1 Header</th>
<td class="col data" data-th="MyGrp1">Item0.1 Value</td>
</tr>
<tr class="grp">
<th class="col label" colspan="2" scope="row">MyGrp</th>
</tr>
<tr class="item">
<th class="col label" scope="row">Item1.1 Header</th>
<td class="col data" >Item1.1 Value</td>
</tr>
<tr class="item">
<th class="col label" scope="row">Item1.2 Header</th>
<td class="col data">Item1.2 Value</td>
</tr>
<tr class="item">
<th class="col label" scope="row">Item1.3 Header</th>
<td class="col data"">Item1.2 Value</td>
</tr>
</tbody>
</table>
I want the table to parsed as below 我希望表解析如下
MyGrp1<new line>
<tab char>Item0.1 Header<tab char>Item0.1 Value<new line>
MyGrp2<new line>
<tab char>Item1.1 Header<tab char>Item1.1 Value<new line>
<tab char>Item1.2 Header<tab char>Item1.2 Value<new line>
<tab char>Item1.3 Header<tab char>Item1.3 Value<new line>
I can get all the nodes of 'tr' or 'th'. 我可以获得“ tr”或“ th”的所有节点。 But I don't know how to iterate the table node by node.
但是我不知道如何逐节点迭代表。 How can I scrape the Html table and get my above result?
如何刮取HTML表格并获得上述结果?
i used pandas for this 我用熊猫做这个
import pandas as pd
import html5lib
string="""<table class="table_c" id="myd">
<tbody>
<tr class="grp">
<th class="col">MyGrp1</th>
</tr>
<tr class="item">
<th class="col label" scope="row">Item0.1 Header</th>
<td class="col data" data-th="MyGrp1">Item0.1 Value</td>
</tr>
<tr class="grp">
<th class="col label" colspan="2" scope="row">MyGrp</th>
</tr>
<tr class="item">
<th class="col label" scope="row">Item1.1 Header</th>
<td class="col data" >Item1.1 Value</td>
</tr>
<tr class="item">
<th class="col label" scope="row">Item1.2 Header</th>
<td class="col data">Item1.2 Value</td>
</tr>
<tr class="item">
<th class="col label" scope="row">Item1.3 Header</th>
<td class="col data"">Item1.2 Value</td>
</tr>
</tbody>
</table>"""
df = pd.read_html(string)
print(df)
output 输出
[ 0 1
0 MyGrp1 NaN
1 Item0.1 Header Item0.1 Value
2 MyGrp NaN
3 Item1.1 Header Item1.1 Value
4 Item1.2 Header Item1.2 Value
5 Item1.3 Header Item1.2 Value]
But I don't know how to iterate the table node by node.
但是我不知道如何逐节点迭代表。
BeautifulSoup
's find_all
provides you with a sequence of tag objects that you can loop through. BeautifulSoup
的find_all
为您提供了一系列可以循环浏览的标记对象。
Also please note that your html table has synthax problems: <th class="col>MyGrp1</th>
- missing quote <td class="col data"">Item1.2 Value</td>
- double quotes 还请注意,您的html表存在合成语法问题:
<th class="col>MyGrp1</th>
-缺少引号<td class="col data"">Item1.2 Value</td>
-双引号
So provided that sample
is your html table as a sting and it has valid html here's a sample of what you could do: 因此,假设该
sample
是作为字符串的html表,并且具有有效的html,以下是您可以执行的操作的示例:
from bs4 import BeautifulSoup as bs
soup = bs(sample, 'lxml-html')
trs = soup.find_all('tr')
group = None # in case there are items before the first group
for tr in trs:
if 'grp' in tr.get('class'):
print(tr.th.text)
elif 'item' in tr.get('class'):
label = tr.th.text
value = tr.td.text
print('{} {}'.format(label, value))
I did the following to get the answer. 我做了以下工作以获得答案。 I give my solution here.
我在这里给出解决方案。 Please correct me If I am wrong.
如果我错了,请纠正我。
result = ""
for tr in table_t.findAll('tr'):
if 'grp' in tr.get("class"):
for th in tr.findAll('th'):
result += "\n" + th.text.strip()
#print(th.text.strip())
elif 'item' in tr.get("class"):
children_th = tr.find("th")
children_td = tr.find("td")
result += "\n\t" + children_th.text.strip() + "\t" + children_td.text.strip()
print(result)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.