[英]Capture information in a list using bs4 when a table is embedded in the list
For this html code: 对于此html代码:
<ul><li>Include these codes as defined in http://unitsofmeasure.org
<table><tr><td><b>Code</b>
</td><td><b>Display</b></td></tr>
<tr><td>min</td><td>Minute</td><td></td></tr>
<tr><td>h</td><td>Hour</td><td></td></tr><tr>
<td>d</td><td>Day</td><td></td></tr>
</table></li></ul>
I just want the information in <li>
section, I mean "Include these codes as defined in http://unitsofmeasure.org"
. 我只想要<li>
部分中的信息,我的意思是"Include these codes as defined in http://unitsofmeasure.org"
。 But because </li>
is ended after table, BS4
also captures information in the table. 但是因为</li>
在表之后结束,所以BS4
也会在表中捕获信息。 This is my code: 这是我的代码:
definition = [li.get_text() for li in ul.findAll("li")]
And this is the output: 这是输出:
[u'Include these codes as defined in http://unitsofmeasure.orgCodeDisplayminMinutehHourdDaywkWeekmoMonthaYear']
How can I edit the code to not capture information in the table? 如何编辑代码以不捕获表中的信息?
您可以使用extract()删除表。
definition = [li.find('table').extract().get_text() for li in ul.findAll("li")]
Try to move up from table tag using previousSibling , more info about available methods at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names 尝试使用previousSibling从表标签上移,有关可用方法的更多信息,请访问https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
t = soup.find('table')
print t.previousSibling
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.