如何使用正则表达式删除python字符串中的十六进制值？

Question

I have a cell array in matlab 我在matlab中有一个单元格数组

columns = {'MagX', 'MagY', 'MagZ', ...
           'AccelerationX',  'AccelerationX',  'AccelerationX', ...
           'AngularRateX', 'AngularRateX', 'AngularRateX', ...
           'Temperature'}

I use these scripts which make use of matlab's hdf5write function to save the array in the hdf5 format. 我使用这些脚本，利用matlab的hdf5write函数将数组保存为hdf5格式。

I then read in the the hdf5 file into python using pytables. 然后我使用pytables将hdf5文件读入python。 The cell array comes in as a numpy array of strings. 单元格数组作为一个numpy字符串数组。 I convert to a list and this is the output: 我转换为列表，这是输出：

>>>columns
['MagX\x00\x00\x00\x08\x01\x008\xe6\x7f',
 'MagY\x00\x7f\x00\x00\x00\xee\x0b9\xe6\x7f',
 'MagZ\x00\x00\x00\x00\x001',
 'AccelerationX',
 'AccelerationY',
 'AccelerationZ',
 'AngularRateX',
 'AngularRateY',
 'AngularRateZ',
 'Temperature']

These hex values pop into the strings from somewhere and I'd like to remove them. 这些十六进制值从某处弹出到字符串中，我想删除它们。 They don't always appear on the first three items of the list and I need a nice way to deal with them or to find out why they are there in the first place. 它们并不总是出现在列表的前三项中，我需要一种很好的方式来处理它们或者首先找出它们为什么存在。

>>>print columns[0]
Mag8�
>>>columns[0]
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>repr(columns[0])
"'MagX\\x00\\x00\\x00\\x08\\x01\\x008\\xe6\\x7f'"
>>>print repr(columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'

I've tried using a regular expression to remove the hex values but have little luck. 我尝试使用正则表达式删除十六进制值，但运气不佳。

>>>re.sub('(\w*)\\\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('(\w*)\\\\x.*', r'\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub(r'(\w*)\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('([A-Za-z]*)\x00', r'\1', columns[0])
'MagX\x08\x018\xe6\x7f'
>>>re.sub('(\w*?)', '\1', columns[0])
'\x01M\x01a\x01g\x01X\x01\x00\x01\x00\x01\x00\x01\x08\x01\x01\x01\x00\x018\x01\xe6\x01\x7f\x01'

Any suggestions on how to deal with this? 有关如何处理这个的任何建议？

Answer 1

You can remove all non-word characters in the following way: 您可以通过以下方式删除所有非单词字符：

>>> re.sub(r'[^\w]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

The regex [^\\w] will match any character that is not a letter, digit, or underscore. 正则表达式[^\\w]将匹配任何不是字母，数字或下划线的字符。 By providing that regex in re.sub with an empty string as a replacement you will delete all other characters in the string. 通过在re.sub提供带有空字符串作为替换的正则表达式，您将删除字符串中的所有其他字符。

Since there may be other characters you want to keep, a better solution might be to specify a larger range of characters that you want to keep that excludes control characters. 由于您可能希望保留其他字符，因此更好的解决方案可能是指定要保留的更大范围的字符，以排除控制字符。 For example: 例如：

>>> re.sub(r'[^\x20-\x7e]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

Or you could replace [^\\x20-\\x7e] with the equivalent [^ -~] , depending on which seems more clear to you. 或者您可以用等效的[^ -~]替换[^\\x20-\\x7e] [^ -~] ，具体取决于哪个更清楚。

To exclude all characters after this first control character just add a .* , like this: 要在第一个控制字符后排除所有字符，只需添加.* ，如下所示：

>>> re.sub(r'[^ -~].*', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX'

Answer 2

They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation - that's why you see a unusual symbol when you print the value. 它们实际上并不在字符串中：您有未转义的控制字符，Python使用十六进制表示法显示 - 这就是您在打印值时看到不寻常符号的原因。

You should simply be able to remove the extra levels of quoting in your regular expression but you might also simply rely on something like the regexp module's generic whitespace class, which will match whitespace characters other than tabs and spaces: 您应该只需要删除正则表达式中额外的引用级别，但您也可以简单地依赖于regexp模块的通用空格类，它将匹配除制表符和空格之外的空白字符：

>>> import re
>>> re.sub(r'\s', '?', "foo\x00bar")
'foo\x00bar'
>>> print re.sub(r'\s', '?', "foo\x00bar")
foobar

I use this one a bit to replace all input whitespace runs, including non-breaking space characters, with a single space: 我使用这一个来替换所有输入空格运行，包括不间断的空格字符，只有一个空格：

>>> re.sub(r'[\xa0\s]+', ' ', input_str)

Answer 3

You can also do this without importing re . 您也可以在不导入re情况下执行此操作 Eg if you're content to keep only ascii characters: 例如，如果您满足于仅保留ascii字符：

good_string = ''.join(c if ord(c) < 129 else '?' for c in bad_string)

如何使用正则表达式删除python字符串中的十六进制值？

问题描述

3 个解决方案

解决方案1
7 已采纳 2011-03-04 19:08:21

解决方案2
1 2011-03-04 19:10:42

解决方案3
0 2016-04-07 06:15:57

如何使用正则表达式删除python字符串中的十六进制值？

问题描述

3 个解决方案

解决方案1 7 已采纳 2011-03-04 19:08:21

解决方案2 1 2011-03-04 19:10:42

解决方案3 0 2016-04-07 06:15:57

解决方案1
7 已采纳 2011-03-04 19:08:21

解决方案2
1 2011-03-04 19:10:42

解决方案3
0 2016-04-07 06:15:57