简体   繁体   English

如何在beautifulsoup中有条件地从html提取文本

[英]How to extract text from html conditionally in beautifulsoup

I am trying to extract specific text from a website with the following html: 我正在尝试使用以下html从网站中提取特定文本:

              ...
               <tr>
                <td>
                 <strong>
                  Location:
                 </strong>
                </td>
                <td colspan="3">
                 90 km S. of Prince Rupert
                </td>
               </tr>
              ...

I want to extract the text that comes after "Location:" (ie "90 km S. of Prince Rupert"). 我想提取“位置:”(即“鲁珀特王子城南90公里”)之后的文本。 There are a whole load of similar websites that I want to loop through and grab the text following "Location:" 我想遍历类似网站的全部内容,并抓取“位置:”下面的文字

I am quite new to python and haven't been able to find a solution to extracting text based on a condition like this. 我对python还是很陌生,还无法找到基于这样的条件提取文本的解决方案。

My understanding is that BS does not handle malformed html as well as LXML. 我的理解是,BS不处理格式错误的html以及LXML。 However, I could be wrong but I have generally used lxml to handle these types of problems. 但是,我可能错了,但是我通常使用lxml来处理这些类型的问题。 Here is some code that you can play with to better understand how to play with the elements. 这是一些代码,您可以使用它们来更好地理解如何使用元素。 There are lots of approaches. 有很多方法。

The best place to get lxml in my opinion is here 我认为获取lxml的最佳位置在这里

from lxml import html

ms = '''<tr>
            <td>
             <strong>
              Location:
             </strong>
            </td>
            <td colspan="3">
             90 km S. of Prince Rupert
            </td>
            <mytag>
            Hello World
            </mytag>
           </tr>'''

mytree = html.fromstring(ms)  #this creates a 'tree' in memory
for e in mytree.iter():       # iterate through the elements
    if e.tag == 'td':         #focus on the elements that are td elements
        if 'location' in e.text_content().lower(): # if location is in the text of a td
            for sib in e.itersiblings(): # find all the siblings of the td
                sib.text_content()   # print the text

'\\n 90 km S. of Prince Rupert\\n '\\ n鲁珀特王子城南90公里\\ n

There is a lot to learn here but lxml is pretty introspective 这里有很多东西要学习,但是lxml很自省

>>> help (e.itersiblings)
Help on built-in function itersiblings:

itersiblings(...)
    itersiblings(self, tag=None, preceding=False)

    Iterate over the following or preceding siblings of this element.

The direction is determined by the 'preceding' keyword which
defaults to False, i.e. forward iteration over the following
siblings.  When True, the iterator yields the preceding
siblings in reverse document order, i.e. starting right before
the current element and going left.  The generated elements
can be restricted to a specific tag name with the 'tag'
keyword.

Note - I changed the string a little bit and added mytag so see the new code based on the help for itersiblings 注意-我稍微更改了字符串并添加了mytag,所以请参阅基于迭代帮助的新代码

for e in mytree.iter():
    if e.tag == 'td':
        if 'location' in e.text_content().lower():
            for sib in e.itersiblings(tag = 'mytag'):
                sib.text_content()


 '\n                hello world\n 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM