[英]python IO: Splitting words from textfile into python array, avoiding escaping chars, newlines, hexvalues
I'm having a hard time importing and splitting words properly from a simple txt-file into a python array. 我很难正确地将单词从简单的txt文件导入并拆分成python数组。
Txtfile: txt文件:
test1 file test1 test1
test2 test2
test 3 test3, test3.
test 4, test 4.
test 5
^Ltest 6.
When doing a simple for lines in file: array.append(lines)
this is the final array i recieve: 当for lines in file: array.append(lines)
进行简单for lines in file: array.append(lines)
这是我收到的最后一个数组:
['test1 file test1 test1\xc2\xa0\n', 'test2 test2\xc2\xa0\n', 'test 3 test3, test3.\n', '\n', 'test 4, test 4.\n', '\n', '\n', 'test 5\n', '\n', '\n', '^Ltest 6.\xc2\xa0\n']
I'd like it to be something like this, where i have one item per actual english word or escaping character, and without the \\x__ hex substrings: 我希望它是这样的,其中每个实际的英语单词或转义字符都有一个项目,并且没有\\ x__十六进制子字符串:
['test1', 'file', 'test1', 'test1', '\n', 'test2', 'test2', '\n', 'test', '3', 'test3', 'test3', '.', '\n', '\n', 'test', '4', 'test', '4', '.', '\n', '\n', '\n', 'test', '5', '\n', '\n', '\n', 'test 6', '.', \n']
Help would be really appreciated, thanks beforehand. 非常感谢您的帮助,在此先感谢您。
Single line solution: 单线解决方案:
[w for w in file.read().split() if re.match(r'[\w\n\.]+$',w)]
For parsing large file, it's better to pre-compile the regex. 为了解析大文件,最好预先编译正则表达式。
import re
word_ptn = re.compile(r'[\w\n\.]+$')
[w for w in file.read().split() if word_ptn.match(w)]
The above regex will filter out strings like 'test1\\xc2\\xa0\\n'
. 上面的正则表达式将过滤掉'test1\\xc2\\xa0\\n'
类'test1\\xc2\\xa0\\n'
字符串。 If you want to keep it, extract the matched string from the regex results: 如果要保留它,请从正则表达式结果中提取匹配的字符串:
word_ptn = re.compile(r'[\w\n\.]+')
Lp = (word_ptn.match(w) for w in file.read().split())
[ w.group(0) for w in Lp if w ]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.