Pandas 和 bs4 html 抓取

Question

我正在从 html 文件中提取数据，它是表格格式，所以我编写了这行代码将所有表格转换为 pandas 的数据框。

dfs = pd.read_html("synced_contacts.html")

现在，打印数据框的第二行表格

dfs[1]

output 如下：

我该怎么做才能使信息不重复出现在两列中，如图所示，并将“First Name”中的“First NameDaniela”作为第一列，将“Daniela”作为值

预计 Output：

表HTML结构：

 <title>Synced contacts</title></head><body class="_5vb_ _2yq _a7o5"><div class="clearfix _ikh"><div class="_4bl9"><div class="_li"><div><table style="width:100%;background:white;position:fixed;z-index:99;"><tr style=""><td height="8" style="line-height:8px;">&nbsp;</td></tr><tr style="background:white"><td style="text-align:left;height:28px;width:35px;"></td><td style="text-align:left;height:28px;"><img src="files/Instagram-Logo.png" height="28" alt="Instagram" /></td></tr><tr style=""><td height="5" style="line-height:5px;">&nbsp;</td></tr></table><div style="width:100%;height:44px;"></div></div><div class="_a705"><div class="_3-8y _3-95 _a70a"><div class="_a70d"><div class="_a70e">Synced contacts</div><div class="_a70f">Contacts you&#039;ve synced</div></div></div><div class="_a706" role="main"><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><table style="table-layout: fixed;"><tr><td colspan="2" class="_2pin _a6_q">First Name<div><div>Daniela</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Last Name<div><div>Guevara</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Contact Information<div><div>3017004914</div></div></td></tr></table></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><table style="table-layout: fixed;"><tr><td colspan="2" class="_2pin _a6_q">First Name<div><div>Marianna</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Contact Information<div><div>3125761972</div></div></td></tr></table></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><table style="table-layout: fixed;"><tr><td colspan="2" class="_2pin _a6_q">First Name<div><div>Ana Maria</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Last Name<div><div>Garzon</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Contact Information<div><div>3214948507</div></div></td></tr></table></div>

Answer 1

它是由结构引起的，所有内容都放在一个<td>中并将被连接起来， colspan正在创建第二列。

pd.read_html()适合第一次也是最简单的通过，但不一定能处理现实生活中的每一张凌乱的表格。

因此，您可以直接使用BeautifulSoup而不是使用pd.read_html()来适应如何抓取您的需求并从结果中创建dataframe的行为。 .stripped_strings在这里用于将<tr>中每个元素的文本拆分为一个list 。

pd.DataFrame(
    [
        dict([list(row.stripped_strings)for row in t.select('tr')]) 
        for t in soup.select('table:has(._2pin)')
    ]
)

例子

from bs4 import BeautifulSoup
html='''
<title>Synced contacts</title></head><body class="_5vb_ _2yq _a7o5"><div class="clearfix _ikh"><div class="_4bl9"><div class="_li"><div><table style="width:100%;background:white;position:fixed;z-index:99;"><tr style=""><td height="8" style="line-height:8px;">&nbsp;</td></tr><tr style="background:white"><td style="text-align:left;height:28px;width:35px;"></td><td style="text-align:left;height:28px;"><img src="files/Instagram-Logo.png" height="28" alt="Instagram" /></td></tr><tr style=""><td height="5" style="line-height:5px;">&nbsp;</td></tr></table><div style="width:100%;height:44px;"></div></div><div class="_a705"><div class="_3-8y _3-95 _a70a"><div class="_a70d"><div class="_a70e">Synced contacts</div><div class="_a70f">Contacts you&#039;ve synced</div></div></div><div class="_a706" role="main"><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><table style="table-layout: fixed;"><tr><td colspan="2" class="_2pin _a6_q">First Name<div><div>Daniela</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Last Name<div><div>Guevara</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Contact Information<div><div>3017004914</div></div></td></tr></table></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><table style="table-layout: fixed;"><tr><td colspan="2" class="_2pin _a6_q">First Name<div><div>Marianna</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Contact Information<div><div>3125761972</div></div></td></tr></table></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><table style="table-layout: fixed;"><tr><td colspan="2" class="_2pin _a6_q">First Name<div><div>Ana Maria</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Last Name<div><div>Garzon</div></div></td></tr><tr><td colspan="2" class="_2pin _a6_q">Contact Information<div><div>3214948507</div></div></td></tr></table></div>
'''

soup = BeautifulSoup(html)

pd.DataFrame(
    [
        dict([list(row.stripped_strings)for row in t.select('tr')]) 
        for t in soup.select('table:has(._2pin)')
    ]
)

Output

	名	姓	联系信息
0	丹妮拉	格瓦拉	3017004914
1个	玛丽安娜	楠	3125761972
2个	安娜玛丽亚	加尔松	3214948507

Pandas 和 bs4 html 抓取

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-11-26 08:48:19

例子

Output

Pandas 和 bs4 html 抓取

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-11-26 08:48:19

例子

Output

解决方案1
0 已采纳 2022-11-26 08:48:19