简体   繁体   English

python IO:将单词从文本文件拆分为python数组,避免转义char,换行符,十六进制值

[英]python IO: Splitting words from textfile into python array, avoiding escaping chars, newlines, hexvalues

I'm having a hard time importing and splitting words properly from a simple txt-file into a python array. 我很难正确地将单词从简单的txt文件导入并拆分成python数组。

Txtfile: txt文件:

test1 file test1 test1 
test2 test2 
test 3 test3, test3.

test 4, test 4.


test 5


^Ltest 6.

When doing a simple for lines in file: array.append(lines) this is the final array i recieve: for lines in file: array.append(lines)进行简单for lines in file: array.append(lines)这是我收到的最后一个数组:

['test1 file test1 test1\xc2\xa0\n', 'test2 test2\xc2\xa0\n', 'test 3 test3, test3.\n', '\n', 'test 4, test 4.\n', '\n', '\n', 'test 5\n', '\n', '\n', '^Ltest 6.\xc2\xa0\n']

I'd like it to be something like this, where i have one item per actual english word or escaping character, and without the \\x__ hex substrings: 我希望它是这样的,其中每个实际的英语单词或转义字符都有一个项目,并且没有\\ x__十六进制子字符串:

['test1', 'file', 'test1', 'test1', '\n', 'test2', 'test2', '\n', 'test', '3', 'test3', 'test3', '.', '\n', '\n', 'test', '4', 'test', '4', '.', '\n', '\n', '\n', 'test', '5', '\n', '\n', '\n', 'test 6', '.', \n']

Help would be really appreciated, thanks beforehand. 非常感谢您的帮助,在此先感谢您。

Single line solution: 单线解决方案:

[w for w in file.read().split() if re.match(r'[\w\n\.]+$',w)]

For parsing large file, it's better to pre-compile the regex. 为了解析大文件,最好预先编译正则表达式。

import re
word_ptn = re.compile(r'[\w\n\.]+$')
[w for w in file.read().split() if word_ptn.match(w)]

The above regex will filter out strings like 'test1\\xc2\\xa0\\n' . 上面的正则表达式将过滤掉'test1\\xc2\\xa0\\n''test1\\xc2\\xa0\\n'字符串。 If you want to keep it, extract the matched string from the regex results: 如果要保留它,请从正则表达式结果中提取匹配的字符串:

word_ptn = re.compile(r'[\w\n\.]+')
Lp = (word_ptn.match(w) for w in file.read().split())
[ w.group(0) for w in Lp if w ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM