简体   繁体   English

将表嵌入到列表中后,使用bs4在列表中捕获信息

[英]Capture information in a list using bs4 when a table is embedded in the list

For this html code: 对于此html代码:

<ul><li>Include these codes as defined in http://unitsofmeasure.org
    <table><tr><td><b>Code</b>
    </td><td><b>Display</b></td></tr>
    <tr><td>min</td><td>Minute</td><td></td></tr>
    <tr><td>h</td><td>Hour</td><td></td></tr><tr>
    <td>d</td><td>Day</td><td></td></tr>
    </table></li></ul>

I just want the information in <li> section, I mean "Include these codes as defined in http://unitsofmeasure.org" . 我只想要<li>部分中的信息,我的意思是"Include these codes as defined in http://unitsofmeasure.org" But because </li> is ended after table, BS4 also captures information in the table. 但是因为</li>在表之后结束,所以BS4也会在表中捕获信息。 This is my code: 这是我的代码:

definition = [li.get_text() for li in ul.findAll("li")]

And this is the output: 这是输出:

[u'Include these codes as defined in http://unitsofmeasure.orgCodeDisplayminMinutehHourdDaywkWeekmoMonthaYear']

How can I edit the code to not capture information in the table? 如何编辑代码以不捕获表中的信息?

您可以使用extract()删除表。

definition = [li.find('table').extract().get_text() for li in ul.findAll("li")]

Try to move up from table tag using previousSibling , more info about available methods at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names 尝试使用previousSibling标签移,有关可用方法的更多信息,请访问https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names

t = soup.find('table')
print t.previousSibling

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM