从 json 文件中的字符串中删除标签（\r、\n、<、>）

Question

i know similar questions have been asked before but so far i wasnt able to solve my problem, so apologies in advance.我知道以前有人问过类似的问题，但到目前为止我无法解决我的问题，所以提前道歉。

I have a json-file ('test.json') with text in it.我有一个包含文本的 json 文件（'test.json'）。 The text appears like this:文本显示如下：

"... >>\r\n>> This is a test.>\r\n> \r\n-- \r\nMit freundlichen Gr&uuml;ssen\r\n\r\nMike Klence ..."

The overal output should be the plain text:总体 output 应该是明文：

"... This is a test. Mit freundlichen Grüssen Mike Klence ..."

With beautifulsoup i got to remove those html tags.使用 beautifulsoup 我必须删除那些 html 标签。 But still those >, \r, \n- - remain in the text.但那些 >、\r、\n-- 仍然保留在文本中。 So i tried the following code:所以我尝试了以下代码：

import codecs
from bs4 import BeautifulSoup

with codecs.open('test.json', encoding = 'utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')
    invalid_tags = ['\r', '\n', '<', '>']
    for tag in invalid_tags: 
        for match in soup.find_all(tag):
            match.replace_with()

print(soup.get_text())

But it doesnt do anything with the text in the file.但它对文件中的文本没有任何作用。 I tried different variations but nothing seems to change at all.我尝试了不同的变体，但似乎没有任何改变。

How can i get my code to work properly?我怎样才能让我的代码正常工作？ Or if there is another, easier or faster way, i would be thankful to read about those approaches as well.或者，如果有另一种更简单或更快的方法，我也会很高兴阅读这些方法。

Btw i am using python 3.6 on anaconda.顺便说一句，我在 anaconda 上使用 python 3.6。

Thank you very much in advance for your help.非常感谢您的帮助。

Answer 1

You could do this using python built-in function replace() . 您可以使用python内置函数replace() 。

with open('test.json', 'r', encoding = 'utf-8') as f:
    content = f.read()
    invalid_tags = ['\\r', '\\n', '<', '>', '-', ';']
    for invalid_tag in invalid_tags:
        content = content.replace(invalid_tag, '')
    content = content.replace('&u', 'ü')

print(content)

Output: 输出：

...  This is a test.  Mit freundlichen GrüumlssenMike Klence ...

Answer 2

You could also try this one liner using regex .您也可以使用regex试试这个衬垫。

import re

string = "... >>\r\n>> This is a test.>\r\n> \r\n-- \r\nMit freundlichen Gr&uuml;ssen\r\n\r\nMike Klence ..."
updatedString = ''.join(re.split(r'[\r\n\<\>]+',string))

print(updatedString)

从 json 文件中的字符串中删除标签（\r、\n、<、>）

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-11-30 14:58:29

解决方案2
0 2022-12-29 07:18:29

从 json 文件中的字符串中删除标签（\r、\n、<、>）

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-11-30 14:58:29

解决方案2 0 2022-12-29 07:18:29

解决方案1
1 已采纳 2018-11-30 14:58:29

解决方案2
0 2022-12-29 07:18:29