I was trying to unify the lines in my file when I observed the following:
word1 word2
word1 word2
I did not understand why these lines were not combined so I opened the file in vim and used :set list
to see if there are any special characters and I found this:
word1 <feff>word2
word1 word2
I am not sure how to clean this word in Python. Any suggestions on what character might be and how this can be cleaned?
U+FEFF is the Byte Order Mark character, which should only occur at the start of a document. In documents, it should be treated as a ZERO WIDTH NON-BREAKING SPACE
. If this causes issues, you can remove it like any other character:
>>> s = u'word1 \ufeffword2'
>>> s = s.replace(u'\ufeff', '')
>>> s
u'word1 word2'
(In Python 3.1 or 3.2, drop the u
in front of strings)
你试过mytext.split(string.whitespace)
吗?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.