简体   繁体   English

如何使用正则表达式删除python字符串中的十六进制值?

[英]How do I remove hex values in a python string with regular expressions?

I have a cell array in matlab 我在matlab中有一个单元格数组

columns = {'MagX', 'MagY', 'MagZ', ...
           'AccelerationX',  'AccelerationX',  'AccelerationX', ...
           'AngularRateX', 'AngularRateX', 'AngularRateX', ...
           'Temperature'}

I use these scripts which make use of matlab's hdf5write function to save the array in the hdf5 format. 我使用这些脚本 ,利用matlab的hdf5write函数将数组保存为hdf5格式。

I then read in the the hdf5 file into python using pytables. 然后我使用pytables将hdf5文件读入python。 The cell array comes in as a numpy array of strings. 单元格数组作为一个numpy字符串数组。 I convert to a list and this is the output: 我转换为列表,这是输出:

>>>columns
['MagX\x00\x00\x00\x08\x01\x008\xe6\x7f',
 'MagY\x00\x7f\x00\x00\x00\xee\x0b9\xe6\x7f',
 'MagZ\x00\x00\x00\x00\x001',
 'AccelerationX',
 'AccelerationY',
 'AccelerationZ',
 'AngularRateX',
 'AngularRateY',
 'AngularRateZ',
 'Temperature']

These hex values pop into the strings from somewhere and I'd like to remove them. 这些十六进制值从某处弹出到字符串中,我想删除它们。 They don't always appear on the first three items of the list and I need a nice way to deal with them or to find out why they are there in the first place. 它们并不总是出现在列表的前三项中,我需要一种很好的方式来处理它们或者首先找出它们为什么存在。

>>>print columns[0]
Mag8�
>>>columns[0]
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>repr(columns[0])
"'MagX\\x00\\x00\\x00\\x08\\x01\\x008\\xe6\\x7f'"
>>>print repr(columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'

I've tried using a regular expression to remove the hex values but have little luck. 我尝试使用正则表达式删除十六进制值,但运气不佳。

>>>re.sub('(\w*)\\\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('(\w*)\\\\x.*', r'\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub(r'(\w*)\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('([A-Za-z]*)\x00', r'\1', columns[0])
'MagX\x08\x018\xe6\x7f'
>>>re.sub('(\w*?)', '\1', columns[0])
'\x01M\x01a\x01g\x01X\x01\x00\x01\x00\x01\x00\x01\x08\x01\x01\x01\x00\x018\x01\xe6\x01\x7f\x01'

Any suggestions on how to deal with this? 有关如何处理这个的任何建议?

You can remove all non-word characters in the following way: 您可以通过以下方式删除所有非单词字符:

>>> re.sub(r'[^\w]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

The regex [^\\w] will match any character that is not a letter, digit, or underscore. 正则表达式[^\\w]将匹配任何不是字母,数字或下划线的字符。 By providing that regex in re.sub with an empty string as a replacement you will delete all other characters in the string. 通过在re.sub提供带有空字符串作为替换的正则表达式,您将删除字符串中的所有其他字符。

Since there may be other characters you want to keep, a better solution might be to specify a larger range of characters that you want to keep that excludes control characters. 由于您可能希望保留其他字符,因此更好的解决方案可能是指定要保留的更大范围的字符,以排除控制字符。 For example: 例如:

>>> re.sub(r'[^\x20-\x7e]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

Or you could replace [^\\x20-\\x7e] with the equivalent [^ -~] , depending on which seems more clear to you. 或者您可以用等效的[^ -~]替换[^\\x20-\\x7e] [^ -~] ,具体取决于哪个更清楚。

To exclude all characters after this first control character just add a .* , like this: 要在第一个控制字符后排除所有字符,只需添加.* ,如下所示:

>>> re.sub(r'[^ -~].*', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX'

They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation - that's why you see a unusual symbol when you print the value. 它们实际上并不在字符串中:您有未转义的控制字符,Python使用十六进制表示法显示 - 这就是您在打印值时看到不寻常符号的原因。

You should simply be able to remove the extra levels of quoting in your regular expression but you might also simply rely on something like the regexp module's generic whitespace class, which will match whitespace characters other than tabs and spaces: 您应该只需要删除正则表达式中额外的引用级别,但您也可以简单地依赖于regexp模块的通用空格类,它将匹配除制表符和空格之外的空白字符:

>>> import re
>>> re.sub(r'\s', '?', "foo\x00bar")
'foo\x00bar'
>>> print re.sub(r'\s', '?', "foo\x00bar")
foobar

I use this one a bit to replace all input whitespace runs, including non-breaking space characters, with a single space: 我使用这一个来替换所有输入空格运行,包括不间断的空格字符,只有一个空格:

>>> re.sub(r'[\xa0\s]+', ' ', input_str)

You can also do this without importing re . 您也可以在不导入re情况下执行此操作 Eg if you're content to keep only ascii characters: 例如,如果您满足于仅保留ascii字符:

good_string = ''.join(c if ord(c) < 129 else '?' for c in bad_string)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Python 中的正则表达式删除以“@”开头并以空白字符结尾的字符串? - How do I remove a string that starts with '@' and ends with a blank character by using regular expressions in Python? 如何使用正则表达式从python中删除字符串中的标签? (不是HTML) - How to remove tags from a string in python using regular expressions? (NOT in HTML) 如何使用python正则表达式将String数据附加到某些位置? - How do I append String data to certain positions using python regular expressions? 如何使用正则表达式将字符串与python中的数字匹配? - How do I match a string up to a number in python using regular expressions? 如何在十六进制字节的字符串中交换相邻字节(带或不带正则表达式) - How to swap adjacent bytes in a string of hex bytes (with or without regular expressions) 如何使用正则表达式处理这样的字符串? - How do I process a string such as this using regular expressions? 如何使用正则表达式检测字符串中的符号? - How do I detect symbols in a string using regular expressions? 如何在 Python 中找到带有正则表达式的字符串 - How can I find string with regular expressions in Python 我如何在python lxml,XPath中使用正则表达式 - How do i use regular expressions in python lxml, XPath 如何在Python中使用带占位符文本的正则表达式? - How do I use regular expressions in Python with placeholder text?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM