python IO：将单词从文本文件拆分为python数组，避免转义char，换行符，十六进制值

Question

I'm having a hard time importing and splitting words properly from a simple txt-file into a python array. 我很难正确地将单词从简单的txt文件导入并拆分成python数组。

Txtfile: txt文件：

test1 file test1 test1 
test2 test2 
test 3 test3, test3.

test 4, test 4.


test 5


^Ltest 6.

When doing a simple for lines in file: array.append(lines) this is the final array i recieve: 当for lines in file: array.append(lines)进行简单for lines in file: array.append(lines)这是我收到的最后一个数组：

['test1 file test1 test1\xc2\xa0\n', 'test2 test2\xc2\xa0\n', 'test 3 test3, test3.\n', '\n', 'test 4, test 4.\n', '\n', '\n', 'test 5\n', '\n', '\n', '^Ltest 6.\xc2\xa0\n']

I'd like it to be something like this, where i have one item per actual english word or escaping character, and without the \\x__ hex substrings: 我希望它是这样的，其中每个实际的英语单词或转义字符都有一个项目，并且没有\\ x__十六进制子字符串：

['test1', 'file', 'test1', 'test1', '\n', 'test2', 'test2', '\n', 'test', '3', 'test3', 'test3', '.', '\n', '\n', 'test', '4', 'test', '4', '.', '\n', '\n', '\n', 'test', '5', '\n', '\n', '\n', 'test 6', '.', \n']

Help would be really appreciated, thanks beforehand. 非常感谢您的帮助，在此先感谢您。

Answer 1

Single line solution: 单线解决方案：

[w for w in file.read().split() if re.match(r'[\w\n\.]+$',w)]

For parsing large file, it's better to pre-compile the regex. 为了解析大文件，最好预先编译正则表达式。

import re
word_ptn = re.compile(r'[\w\n\.]+$')
[w for w in file.read().split() if word_ptn.match(w)]

The above regex will filter out strings like 'test1\\xc2\\xa0\\n' . 上面的正则表达式将过滤掉'test1\\xc2\\xa0\\n'类'test1\\xc2\\xa0\\n'字符串。 If you want to keep it, extract the matched string from the regex results: 如果要保留它，请从正则表达式结果中提取匹配的字符串：

word_ptn = re.compile(r'[\w\n\.]+')
Lp = (word_ptn.match(w) for w in file.read().split())
[ w.group(0) for w in Lp if w ]

python IO：将单词从文本文件拆分为python数组，避免转义char，换行符，十六进制值

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-06-16 19:46:42

python IO：将单词从文本文件拆分为python数组，避免转义char，换行符，十六进制值

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-06-16 19:46:42

解决方案1
0 已采纳 2016-06-16 19:46:42