简体   繁体   English

用于删除奇怪字符的Python正则表达式

[英]Python regex for removing strange characters

Hello, 你好,

I have a list of string with some strange characters (like: ). 我有一个带有一些奇怪字符的字符串列表(如: )。 For instance: 例如:

'Replay fortement conseillé �\x9f\x98\x82�\x9f\x98\x82'

Or: 要么:

'Le papa du mois �\x9f\x91\x8a'

I want to remove \\x9f\\x91\\x8a and \\x9f\\x98\\x82 \\x9f\\x98\\x82 from these strings. 我想从这些字符串中删除 \\ x9f \\ x91 \\ x8a和 \\ x9f \\ x98 \\x82 \\ x9f \\ x98 \\ x82。

I tried this regex: ((.?)\\\\x[0-9]([az]|[0-9])(.?)+)+ but it doesn't work. 我试过这个正则表达式: ((.?)\\\\x[0-9]([az]|[0-9])(.?)+)+但它不起作用。 I'm a newbie in regex so I ask for help. 我是正则表达式的新手,所以我请求帮助。

Thanks you 谢谢

It's probably better to handle those characters instead of removing them, but if you want to remove them in Python you can do that without regular expressions. 处理这些字符而不是删除它们可能会更好,但是如果你想在Python中删除它们,你可以在没有正则表达式的情况下完成它们。

text.decode("ascii", "ignore")

This line will decode a byte array in Python and only keep ASCII characters. 该行将解码Python中的字节数组,并仅保留ASCII字符。

To hold specific characters in the string like é in conseillé . 要像在字符串中保存特定字符éconseillé

You should find the substring that you want to delete And to do this, you need to find the beginning and the end of the substring. 您应该找到要删除的子字符串要执行此操作,您需要找到子字符串的开头和结尾。

This is done better with stringed methods 使用弦乐方法可以做得更好

for an example: 举个例子:

if in any string start character for delete is : 如果在任何字符串中开始删除字符是:

and end of string is len of the string: 和字符串的结尾是字符串的len:

re.sub(r' .*','', 'Replay fortement conseillé \\x9f\\x98\\x82 \\x9f\\x98\\x82')

i hope this could help you 我希望这可以帮到你

In my experience it's a little safer to create a list of 'safe' characters to keep. 根据我的经验,创建一个要保留的“安全”字符列表会更安全一些。 What you are looking to do today is 'fix' that sentence and get rid of yucky stuff. 你今天要做的就是“修复”这句话并摆脱令人讨厌的东西。 But what if some more goofball stuff shows up? 但是如果有更多的goofball东西出现呢? I have a requirement for the data I process to only keep 'standard ascii' as decided by a business owner, so I use this regex: 我要求我处理的数据只保留业务所有者决定的“标准ascii”,所以我使用这个正则表达式:

text = re.sub("[^\x20-\x7E]", "", text)

That way I remove anything that isn't in that character class, pretty much anything not on a standard keyboard. 这样我就删除了那个不在那个角色类中的东西,几乎没有任何东西不在标准键盘上。 You may have better luck going this route. 你可能会有更好的运气去这条路线。 It's hard to predict what trash characters are going to come down the road, and then you end up editing your regex to keep adding stuff to strip out. 很难预测哪些垃圾字符将会出现在路上,然后你最终编辑你的正则表达式以继续添加东西去剥离。 Make a list of stuff to keep :) 列出要保留的东西:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM