美丽的汤解析错误

Question

I am trying to use beautifulsoup to first remove the <a> tags in the html string, but keep it's content. 我试图使用beautifulsoup首先删除html字符串中的<a>标记，但要保留其内容。 After that I would like to remove all tags and replace them with new lines. 之后，我想删除所有标签并用新行替换它们。

The strip_tags function is from This post . strip_tags函数来自此文章。

Here is an example of what I am trying to do: 这是我要执行的操作的一个示例：

text = "<p>This is a <a>test</a></p>"
soup = strip_tags(text, ["a"])
plain_text = soup.get_text("\n")
print(plain_text)

For some reason the output is u'This is a \\ntest' . 由于某种原因，输出为u'This is a \\ntest' 。 If the <a> tag is stripped out already why does it think it is still there? 如果<a>标签已经被剥离，为什么会认为它仍然存在？

The expected output is This is a test . 预期的输出是This is a test 。

A more complex example: First<a>Link</a>Second 一个更复杂的示例： First<a>Link</a>Second

How can I separate between  tags, and still be able to strip the <a> tag out? 如何在标记之间进行分隔，并且仍然能够剥离<a>标记？

Indeed if you print soup.encode_contents() , no <a> is there. 确实，如果您打印soup.encode_contents() ，则没有<a> 。

Answer 1

The strip_tags function is from This post . strip_tags函数来自此文章。

That function replaces tags with text they contain, recursively. 该函数以递归方式用标签中包含的文本替换标签。

Thus, your '<a>test</a>' is replaced with 'test' . 因此，您的'<a>test</a>'被替换为'test' 。 No '<a>' tags there. 那里没有'<a>'标签。

Answer 2

It behaves that way because the strip_tags function is manipulating NavigableStrings. 之所以这样，是因为strip_tags函数正在操纵NavigableStrings。 (which is why you see all the unicode casts in strip_tags) （这就是为什么您在strip_tags中看到所有unicode强制转换的原因）

When you run the soup.get_text("\\n") it is seeing all elements of the NavigableString and adding the "\\n" at the splits, even though there is no <a> tag present. 当您运行soup.get_text（“ \\ n”）时，即使没有<a>标记，它也会看到NavigableString的所有元素并在拆分处添加“ \\ n”。

Why not just use get_text() to get the text with the tags removed? 为什么不只使用get_text（）获取已删除标签的文本？

text = "<p>This is a <a>test</a> man</p> <p> more stinking <a>p</a> tags </p>"
plain_text = BeautifulSoup(text, 'html.parser')
ptags = plain_text.find_all('p')
mytext = ""
for tag in ptags:
    mytext = mytext + tag.get_text() + "\n"
print(mytext)

美丽的汤解析错误

问题描述

2 个解决方案

解决方案1
-1 2016-07-08 17:42:07

解决方案2
-1 2016-07-08 18:08:02

美丽的汤解析错误

问题描述

2 个解决方案

解决方案1 -1 2016-07-08 17:42:07

解决方案2 -1 2016-07-08 18:08:02

解决方案1
-1 2016-07-08 17:42:07

解决方案2
-1 2016-07-08 18:08:02