Beautiful Soup Parsing Error

Question

I am trying to use beautifulsoup to first remove the <a> tags in the html string, but keep it's content. After that I would like to remove all tags and replace them with new lines.

The strip_tags function is from This post .

Here is an example of what I am trying to do:

text = "<p>This is a <a>test</a></p>"
soup = strip_tags(text, ["a"])
plain_text = soup.get_text("\n")
print(plain_text)

For some reason the output is u'This is a \\ntest' . If the <a> tag is stripped out already why does it think it is still there?

The expected output is This is a test .

A more complex example: First<a>Link</a>Second

How can I separate between  tags, and still be able to strip the <a> tag out?

Indeed if you print soup.encode_contents() , no <a> is there.

Answer 1

The strip_tags function is from This post .

That function replaces tags with text they contain, recursively.

Thus, your '<a>test</a>' is replaced with 'test' . No '<a>' tags there.

Answer 2

It behaves that way because the strip_tags function is manipulating NavigableStrings. (which is why you see all the unicode casts in strip_tags)

When you run the soup.get_text("\\n") it is seeing all elements of the NavigableString and adding the "\\n" at the splits, even though there is no <a> tag present.

Why not just use get_text() to get the text with the tags removed?

text = "<p>This is a <a>test</a> man</p> <p> more stinking <a>p</a> tags </p>"
plain_text = BeautifulSoup(text, 'html.parser')
ptags = plain_text.find_all('p')
mytext = ""
for tag in ptags:
    mytext = mytext + tag.get_text() + "\n"
print(mytext)

Beautiful Soup Parsing Error

Question

2 answers

solution1
-1 2016-07-08 17:42:07

solution2
-1 2016-07-08 18:08:02

Beautiful Soup Parsing Error

Question

2 answers

solution1 -1 2016-07-08 17:42:07

solution2 -1 2016-07-08 18:08:02

solution1
-1 2016-07-08 17:42:07

solution2
-1 2016-07-08 18:08:02