简体   繁体   English

如何使用BeautifulSoup删除父标签

[英]How to remove parent tag with BeautifulSoup

I am trying to remove the header cells from a html table using BeautifulSoup. 我正在尝试使用BeautifulSoup从html表中删除标头单元格。 I have something like; 我有类似的东西;

<tr> <th> head1 </th> <th> head2 </th> </tr>

I am using the following code to remove all the header cells; 我正在使用以下代码删除所有标题单元格;

soup = BeautifulSoup(url)    
for headless in soup.find_all('th'):
        headless.decompose()

This works great, except I am left with an empty row which messes things up later; 这很有效,除了我留有一个空行,以后将事情弄乱了。

<tr> </tr>

I tried the following code but I get an AttributeError: 'NoneType' object has no attribute 'decompose' 我尝试了以下代码,但得到了AttributeError:'NoneType'对象没有属性'decompose'

for headless in soup.find_all('th'):
    headless.parent.decompose()

How can I either get rid of the row containing header cells or remove the blank row later? 如何摆脱包含标头单元格的行或以后删除空白行? Thanks. 谢谢。

That's because you removed the outer <tr> at the first iteration (when headless=<th>head2</th> ), so that when the iteration reaches <th>head2</th> it's parent is None . 这是因为您在第一次迭代中(当headless=<th>head2</th> )除去了外部的<tr> ,因此,当迭代达到<th>head2</th> ,其父级为None

You could, instead, iterate through <tr> s having child <td> like so : 相反,您可以像这样遍历具有子<td> <tr>

for headless in (tr for tr in soup.find_all('tr') if tr.find('th')):
    headless.decompose()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM