如何使用 python 中的正则表达式从字节中提取单词？

Question

I've a bytes:我有一个字节：

b'\n\x1b\t\xff\xff\xff\x7f@^\x8a?\x11\x00\x00\x00@\xe8HL\xbf\x19\x00\x00\x00\x00\x95\xb0\xd9?\x127\r\xc9\xd5"=\x15\xc9\xd5"=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07Bollard0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?' b'\n\x1b\t\xff\xff\xff\x7f@^\x8a?\x11\x00\x00\x00@\xe8HL\xbf\x19\x00\x00\x00\x00\x95\xb0\xd9 ?\x127\r\xc9\xd5"=\x15\xc9\xd5"=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07Bollard0\x01\x11\x00\x00\x00\ x00\x00\x00\xf0？ b'\n\x1b\t\x01\x00\x00\x00\xa4\x9b\xb0\xbf\x11\x01\x00\x00\xc0/\xe3\x90?\x19\x01\x00\x00\xa0U\xc4\xef?\x127\r|\x934=\x15|\x934=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07TV Series0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?' b'\n\x1b\t\x01\x00\x00\x00\xa4\x9b\xb0\xbf\x11\x01\x00\x00\xc0/\xe3\x90?\x19\x01\x00\x00\xa0U \xc4\xef?\x127\r|\x934=\x15|\x934=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07TV Series0\x01\x11\x00\x00\x00 \x00\x00\x00\xf0？

Using regex, I want to extract words(in this case "Movies", "Movies" and "TV Series")使用正则表达式，我想提取单词（在本例中为“电影”、“电影”和“电视剧”）

What I tried:我尝试了什么：

Extract word from string Using python regex 使用 python 正则表达式从字符串中提取单词

Extracting words from a string, removing punctuation and returning a list with separated words 从字符串中提取单词，删除标点符号并返回带有分隔单词的列表

Python regex for finding all words in a string Python 正则表达式，用于查找字符串中的所有单词

Answer 1

Usually you would convert bytes into a string using the .decode() method.通常您会使用.decode()方法将字节转换为字符串。 However, your bytes contain values that are not ASCII or UTF-8.但是，您的字节包含不是 ASCII 或 UTF-8 的值。

My suggestion is to go through each byte and try interpreting it as an ASCII value我的建议是通过每个字节 go 并尝试将其解释为 ASCII 值

raw= b'\n\x1b\t\xff\xff\xff\x7f@^\x8a?\x11\x00\x00\x00@\xe8HL\xbf\x19\x00\x00\x00\x00\x95\xb0\xd9?\x127\r\xc9\xd5"=\x15\xc9\xd5"=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07Bollard0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?' b'\n\x1b\t\x01\x00\x00\x00\xa4\x9b\xb0\xbf\x11\x01\x00\x00\xc0/\xe3\x90?\x19\x01\x00\x00\xa0U\xc4\xef?\x127\r|\x934=\x15|\x934=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07TV Series0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?'
string = ""
for b in raw:
    string += chr(b)
print(string)

After that, you can use a Regex approach to find words.之后，您可以使用正则表达式方法来查找单词。 It's usually a good idea to define a minimum length for a word.定义一个单词的最小长度通常是个好主意。

import re
for word in re.split('\W', string):
    if len(word) > 3:
        print(word)

That will give you:这会给你：

Movies
Bollard0
Movies
Series0

You have not mentioned "Bollard0", but I assume that was a mistake.您没有提到“Bollard0”，但我认为这是一个错误。

If you want spaces to be part of your string, you'll need to adapt the Regex.如果您希望空格成为字符串的一部分，则需要调整正则表达式。 \W splits on word boundaries and Space is considered a boundary. \W在单词边界上拆分，空格被视为边界。

如何使用 python 中的正则表达式从字节中提取单词？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-07-01 09:25:40

如何使用 python 中的正则表达式从字节中提取单词？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-07-01 09:25:40

解决方案1
0 已采纳 2020-07-01 09:25:40