美丽的汤对象省略信息

Question

Problem: The beautiful soup object seems to delete valuable information from the HTML. 问题：漂亮的汤对象似乎从HTML中删除了有价值的信息。 Why is it doing this, and how can I extract this field? 为什么这样做，如何提取此字段？

Example: The raw HTML I'm interested in expresses this: 示例：我感兴趣的原始HTML表示如下：

<div id="KittyChow">
            <h4 class="noteText">foodAmount</h4>
            <span>< 1 tsp</span>
        </div>

When I create my soup object however, the corresponding lines of HTML become: 但是，当我创建汤对象时，HTML的相应行变为：

<div id="KittyChow"><h4 class="noteText">foodAmount</h4><span></span></div>

My problem and question: Why has it deleted the information in between span and /span? 我的问题和疑问：为什么删除了span和/ span之间的信息？ Is it because the "less than/ <" sign indicated some HTML so it stripped that? 是因为“小于/ <”符号表示一些HTML，所以将其剥离了吗？ I want to know WHY this happens. 我想知道为什么会这样。 I couldn't seem to find an explanation in the documentation.... Is there ANY WAY to parse this in BeautifulSoup? 我似乎在文档中找不到解释。...在BeautifulSoup中有什么方法可以解析它？

Second: How do I extract this < 1 tsp parameter? 第二：如何提取小于1茶匙的参数？ I've tried creating a regex with a left and right endpoint, and that ALMOST works. 我尝试过创建带有左右端点的正则表达式，并且ALMOST可以正常工作。 I know how to use regex to return text if I specify a "left substring match" and a "right substring match." 如果我指定了“左子字符串匹配”和“右子字符串匹配”，我知道如何使用正则表达式返回文本。 For instance, the code below will return "cat." 例如，下面的代码将返回“ cat”。

import re

string= "The cat is obese."
left= "The"
right= "is obese."

pattern= re.compile(left + "(.*?)" + right)
answer= pattern.findall(string)[0]

print answer

The issue is, when I replace the left and right match string with HTML, I get the "index is out of bounds" error, because of the whitespace and indentation implicated with casting HTML into a string. 问题是，当我用HTML替换左右匹配字符串时，出现“索引超出范围”错误，因为空格和缩进与将HTML转换为字符串有关。

So as you can tell... I've done a fair bit of research, and I'm still stuck on extracting < and > signs within fields/ attributes of HTML tags using both BeautifulSoup and Python's regex module. 如您所知...我已经做了相当多的研究，但我仍然坚持使用BeautifulSoup和Python的regex模块在HTML标签的字段/属性中提取<和>符号。 Please help me? 请帮我？ :) :)

Answer 1

Do you have control over your html? 您可以控制html吗？ It is malformed. 格式错误。 Instead of 代替

<div id="KittyChow">
    <h4 class="noteText">foodAmount</h4>
    <span>< 1 tsp</span>
</div>

It should look like 它看起来像

<div id="KittyChow">
    <h4 class="noteText">foodAmount</h4>
    <span>&lt; 1 tsp</span>
</div>

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

If you are generating the html on the server-side, it should be easy in any language to encode your entities: php python ruby 如果要在服务器端生成html，则以任何语言对实体进行编码都应该很容易： php python ruby

edit : According to this other answer: https://stackoverflow.com/a/14171433/1253312 You can do this: 编辑：根据另一个答案： https : //stackoverflow.com/a/14171433/1253312您可以执行以下操作：

BeautifulSoup("<div> < 20 </div>", "html5lib")

Which tells BS to use a different parser, which can handle the < character. 它告诉BS使用不同的解析器，该解析器可以处理<字符。

Answer 2

The HTML is broken. HTML已损坏。 You can't have an unescaped < character in HTML; HTML中不能包含转义的<字符； the parser will get mightily confused. 解析器将非常混乱。 As a workaround, in this particular example you could replace < followed by a space with < 解决方法是，在此特定示例中，您可以将<后面替换为<的空格< followed by a space: 后跟一个空格：

raw_html = raw_html.replace("< ", "&lt; ")

Although this is not a general solution. 虽然这不是一般的解决方案。

美丽的汤对象省略信息

问题描述

2 个解决方案

解决方案1
1 已采纳 2013-06-14 19:09:40

解决方案2
0 2013-06-14 19:10:25

美丽的汤对象省略信息

问题描述

2 个解决方案

解决方案1 1 已采纳 2013-06-14 19:09:40

解决方案2 0 2013-06-14 19:10:25

解决方案1
1 已采纳 2013-06-14 19:09:40

解决方案2
0 2013-06-14 19:10:25