I'm having a hard time trying to encode a python list, I already did it with a text file in order to count specific words inside it, using re module.
This is the code:
# encoding text file
with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f:
for line in f:
# Using re module to extract specific words
unicode_pattern = re.compile(r'\b\w{4,20}\b', re.UNICODE)
result = unicode_pattern.findall(line)
word_counts = Counter(result) # It creates a dictionary key and wordCount
Allwords = []
for clave in word_counts:
if word_counts[clave] >= 10: # We look for the most repeated words
word = clave
Allwords.append(word)
print Allwords
Part of the output looks like this:
[...u'recursos', u'Partidos', u'Constituci\xf3n', u'veh\xedculos', u'investigaci\xf3n', u'Pol\xedticos']
If I print
variable word
the output looks as it should be. However, when I use append
, all the words breaks again, as the example before.
I use this example:
[x.encode("utf-8") for x in Allwords]
The output looks exactly the same as before.
I also use this example:
Allwords.append(str(word.encode("utf-8")))
The output change, but the words don't look as they should be:
[...'recursos', 'Partidos', 'Constituci\xc3\xb3n', 'veh\xc3\xadculos', 'investigaci\xc3\xb3n', 'Pol\xc3\xadticos']
Some of the answers have given this example:
print('[' + ', '.join(Allwords) + ']')
The output looks like this:
[...recursos, Partidos, Constitución, vehÃculos, investigación, PolÃticos]
To be honest I do not want to print the list, just encode it, so that all items (words) are recognized.
I'm looking for something like this:
[...'recursos', 'Partidos', 'Constitución', 'vehículos', 'investigación', 'Políticos']
Any suggestions to solve the problem are appreciated
Thanks,
you might what to try
print('[' + ', '.join(Allwords) + ']')
Your Unicode string list is correct. When you print lists the items in the list display as their repr()
function. When you print the items themselves, the items display as their str()
function. It is only a display option, similar to printing integers as decimal or hexadecimal.
So print the individual words if you want to see them correctly, but for comparisons the content is correct.
It's worth noting that Python 3 changes the behavior of repr()
and now will display non-ASCII characters without escape codes if the terminal supports them directly and the ascii()
function reproduces the Python 2 repr()
behavior.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.