简体   繁体   English

字符串中隐藏的unicode字符集

[英]Set of hidden unicode characters in a string

Some hidden set of Unicode characters appear in a string which needs to be removed. 一些隐藏的Unicode字符集出现在需要删除的字符串中。

I have a very large text which is extracted from a PDF file using PyPDF2 package. 我有一个非常大的文本,使用PyPDF2包从PDF文件中提取。 Now this extracted text has a lot of issues in it (like text in tables inside PDF which were structured will appear randomly when extracted) and lots of special characters also get embedded in it (like ~~~~~~~, }}}}}}}} etc) although those texts are not present when viewed as a PDF file. 现在这个提取的文本中有很多问题(就像PDF中的表格中的文本一样,在提取时会随机出现),并且很多特殊字符也嵌入其中(比如~~~~~~~~,}}}虽然这些文本在被视为PDF文件时不存在,但是等等)。 I tried removing those characters using the solution described in this , this and this link but the problem still appears 我试图消除使用描述的解决方案这些字符这个这个这个链接,但仍然出现问题

myText = "There is a set of hidden character here => <= but it will get printed in console"

print(myText)

Now I would like to have a clean text without those hidden characters. 现在我想要一个没有隐藏字符的干净文本。

The character \\x7f is the ascii character DEL , which explains why your attempts did not work. 字符\\x7fascii字符DEL ,它解释了为什么你的尝试不起作用。 To remove all "special" ascii characters use this code: 要删除所有“特殊”ascii字符,请使用以下代码:

See here for the bytes.decode documentation . 有关bytes.decode 文档,请参见此处。

import string
a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if i in string.printable)))

or this if no you don't want to import string: 或者如果你不想导入字符串:

a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if 31 < ord(i) < 127 or i in '\r\n')))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM