使用BeautifulSoup解析html文件

Question

我有这个html文件：

<html>
    <head></head>
    <body>
        Text1  
        Text2
        <a href="XYCL7Q.html">
            Text3 
        </a>
    </body>
</html>

我想分别收集Text1，Text2和Text3。 对于Text3，我没有问题，但是我无法捕获Text1-2。 通过做这个：

 from urllib import urlopen
 from bs4 import BeautifulSoup

 url = 'myUrl';
 html = urlopen(url).read()
 soup = BeautifulSoup(html)
 soup.body.get_text()

我得到的所有文本（由于再次得到Text3而导致的第一个问题）没有很好地分开，因为Text1-2可能包含一些空格...例如，如果Text1是“ hello world”而Text2是“ foo bar”，最后我想要2个字符串的列表：

 results = ['hello world', 'foo bar']

我怎样才能做到这一点？ 谢谢你的回答...

Answer 1

您想要的文本是“ body”的第一个子节点。 您可以将其拉出并剥离果皮

>>> from bs4 import BeautifulSoup as bs
>>> soup=bs("""<html>
...     <head></head>
...     <body>
...         Text1  
...         Text2
...         <a href="XYCL7Q.html">
...             Text3 
...         </a>
...     </body>
... </html>""")
...
>>> body=soup.find('body')
>>> type(next(body.children))
<class 'bs4.element.NavigableString'>
>>> next(body.children)
u'\n        Text1  \n        Text2\n        '
>>> [stripped for stripped in (item.strip() for item in next(body.children).split('\n')) if stripped]
[u'Text1', u'Text2']

使用BeautifulSoup解析html文件

问题描述

1 个解决方案

解决方案1
0 2014-12-05 18:10:20

使用BeautifulSoup解析html文件

问题描述

1 个解决方案

解决方案1 0 2014-12-05 18:10:20

解决方案1
0 2014-12-05 18:10:20