简体   繁体   English

有没有办法从 Python/pandas 中的字符串中只删除坏字符?

[英]Is there way to remove only BAD characters from a string in Python/pandas?

I am trying to read a PDF using Camelot library and store it to a dataframe.我正在尝试使用 Camelot 库读取 PDF 并将其存储到数据框中。 The resulting dataframe has garbled/bad characters in string fields.生成的数据框在字符串字段中有乱码/坏字符。

Eg: 123Rise – Tower & Troe's Mech –例如: 123Rise – Tower & Troe 's Mech –

I want to remove ONLY the Garbled characters and keep everything else including symbols.我只想删除乱码字符并保留包括符号在内的所有其他内容。

I tried regex such as these [^\w.,&,'-\s] to only keep desirable values.我尝试了诸如 [^\w.,&,'-\s] 之类的正则表达式,以仅保留所需的值。 But I'm having to add every special character which need not be removed into this.但是我必须添加每个不需要删除的特殊字符。 I cannot ditch Camelot library as well.我也不能放弃 Camelot 库。

Is there a way to solve this ??有没有办法解决这个问题??

You could try to use unicodedata library to normalize the data you have, for example:您可以尝试使用 unicodedata 库来规范化您拥有的数据,例如:

import unicodedata

def formatString(value, allow_unicode=False):
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    return(value)

print(formatString("123Rise – Tower & Troe's Mech–"))

Result:结果:

123Rise a Tower & Troe's Mecha

One way to achieve that, is to remove non-ASCII characters.实现这一目标的一种方法是删除非 ASCII 字符。

my_text = "123Rise – Tower & Troe's Mech–"
my_text = ''.join([char if ord(char) < 128 else '' for char in my_text])
print(my_text)

Result:结果:

123Rise  Tower & Troe's Mech

Also you can use this website as reference to normal and extended ASCII characters.您也可以使用本网站作为普通和扩展 ASCII 字符的参考。

Another way I commonly use for filtering out non-ascii garbage and may be relevant (or not) is:我常用的另一种过滤非 ascii 垃圾并且可能相关(或不相关)的方法是:

# Your "messy" data in question.
string = "123Rise – Tower & Troe's Mech–"

# Iterate over each character, and filter by only ord(c) < 128.
clean = "".join([c for c in string if ord(c) < 128])

What is ord ? 什么是ord Ord (as I understand it) converts a character to its binary/ascii numeric representation. Ord(据我了解)将字符转换为其二进制/ascii 数字表示。 You can use this to your advantage, by filtering only numbers less than 128 (as above) which will limit your text range to basic ascii and no unicode stuff without having to work with messy encodings.您可以通过仅过滤小于 128 的数字(如上)来利用这一点,这会将您的文本范围限制为基本的 ascii 并且没有 unicode 内容,而无需使用混乱的编码。

Hope that helps!希望有帮助!

Removing non-ASCII characters using regex will be fast:使用正则表达式删除非 ASCII 字符会很快:

import re
text = "123Rise – Tower & Troe's Mech–"
re.sub(r'[^\x00-\x7F]+','', text)

The output will be:输出将是:

"123Rise  Tower & Troe's Mech"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM