简体   繁体   English

美丽的汤对象省略信息

[英]Beautiful Soup Object Omitting Information

Problem: The beautiful soup object seems to delete valuable information from the HTML. 问题:漂亮的汤对象似乎从HTML中删除了有价值的信息。 Why is it doing this, and how can I extract this field? 为什么这样做,如何提取此字段?

Example: The raw HTML I'm interested in expresses this: 示例:我感兴趣的原始HTML表示如下:

<div id="KittyChow">
            <h4 class="noteText">foodAmount</h4>
            <span>< 1 tsp</span>
        </div>

When I create my soup object however, the corresponding lines of HTML become: 但是,当我创建汤对象时,HTML的相应行变为:

<div id="KittyChow"><h4 class="noteText">foodAmount</h4><span></span></div>

My problem and question: Why has it deleted the information in between span and /span? 我的问题和疑问:为什么删除了span和/ span之间的信息? Is it because the "less than/ <" sign indicated some HTML so it stripped that? 是因为“小于/ <”符号表示一些HTML,所以将其剥离了吗? I want to know WHY this happens. 我想知道为什么会这样。 I couldn't seem to find an explanation in the documentation.... Is there ANY WAY to parse this in BeautifulSoup? 我似乎在文档中找不到解释。...在BeautifulSoup中有什么方法可以解析它?

Second: How do I extract this < 1 tsp parameter? 第二:如何提取小于1茶匙的参数? I've tried creating a regex with a left and right endpoint, and that ALMOST works. 我尝试过创建带有左右端点的正则表达式,并且ALMOST可以正常工作。 I know how to use regex to return text if I specify a "left substring match" and a "right substring match." 如果我指定了“左子字符串匹配”和“右子字符串匹配”,我知道如何使用正则表达式返回文本。 For instance, the code below will return "cat." 例如,下面的代码将返回“ cat”。

import re

string= "The cat is obese."
left= "The"
right= "is obese."

pattern= re.compile(left + "(.*?)" + right)
answer= pattern.findall(string)[0]

print answer

The issue is, when I replace the left and right match string with HTML, I get the "index is out of bounds" error, because of the whitespace and indentation implicated with casting HTML into a string. 问题是,当我用HTML替换左右匹配字符串时,出现“索引超出范围”错误,因为空格和缩进与将HTML转换为字符串有关。

So as you can tell... I've done a fair bit of research, and I'm still stuck on extracting < and > signs within fields/ attributes of HTML tags using both BeautifulSoup and Python's regex module. 如您所知...我已经做了相当多的研究,但我仍然坚持使用BeautifulSoup和Python的regex模块在HTML标签的字段/属性中提取<和>符号。 Please help me? 请帮我? :) :)

Do you have control over your html? 您可以控制html吗? It is malformed. 格式错误。 Instead of 代替

<div id="KittyChow">
    <h4 class="noteText">foodAmount</h4>
    <span>< 1 tsp</span>
</div>

It should look like 它看起来像

<div id="KittyChow">
    <h4 class="noteText">foodAmount</h4>
    <span>&lt; 1 tsp</span>
</div>

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

If you are generating the html on the server-side, it should be easy in any language to encode your entities: php python ruby 如果要在服务器端生成html,则以任何语言对实体进行编码都应该很容易: php python ruby

edit : According to this other answer: https://stackoverflow.com/a/14171433/1253312 You can do this: 编辑 :根据另一个答案: https : //stackoverflow.com/a/14171433/1253312您可以执行以下操作:

BeautifulSoup("<div> < 20 </div>", "html5lib")

Which tells BS to use a different parser, which can handle the < character. 它告诉BS使用不同的解析器,该解析器可以处理<字符。

The HTML is broken. HTML已损坏。 You can't have an unescaped < character in HTML; HTML中不能包含转义的<字符; the parser will get mightily confused. 解析器将非常混乱。 As a workaround, in this particular example you could replace < followed by a space with &lt; 解决方法是,在此特定示例中,您可以将<后面替换为&lt;的空格&lt; followed by a space: 后跟一个空格:

raw_html = raw_html.replace("< ", "&lt; ")

Although this is not a general solution. 虽然这不是一般的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM