[英]How can I preserve the French special characters when transforming a text into a wordlist?
I am using a snippet of code for a markov chain sentence generator. 我正在为markov链语句生成器使用一小段代码。 Works fine in English, but in French, it doesn't print out the special characters (é, è, etc.).
使用英语工作正常,但是使用法语时,不会打印出特殊字符(é,è等)。
This is the part that reads a file and creates a wordlist from it. 这是读取文件并从中创建单词表的部分。 I use the print statements as controls, which allow me to see that print(text) prints the special characters, but once the word is added to the wordlist, they disappear.
我将打印语句用作控件,这使我可以看到print(text)打印特殊字符,但是将单词添加到单词列表后,它们便消失了。
def wordlist(filename):
f = open(filename, mode='r')
text = f.read()
print(text)
wordlist = [fixCaps(w) for w in re.findall(r"[\w']+|[.,!?;]", text)]
print(wordlist)
f.close()
return wordlist
How can I preserve the special characters when creating the word list? 创建单词列表时如何保留特殊字符? (I am running this on Windows 7 with Python 2.x)
(我正在Windows 7上使用Python 2.x运行它)
Example of output: 输出示例:
Permettez-moi d'inscrire votre nom en tête de ce livre et au-
dessus même de sa dédicace; car c'est à vous, surtout, que j'en
dois la publication. En passant par votre magnifique plaidoirie,
mon oeuvre a acquis pour moi-même comme une autorité imprévue.
Acceptez donc ici l'hommage de ma gratitude, qui, si grande
qu'elle puisse être, ne sera jamais à la hauteur de votre
éloquence et de votre dévouement.
['Permettez', 'moi', "d'inscrire", 'votre', 'nom', 'en', 't', 'te', 'de', 'ce', 'livre', 'et', 'au', 'dessus', 'm', 'me', 'de', 'sa', 'd', 'dicace', ';', 'car', "c'est", 'vous', ',', 'surtout', ',', 'que', "j'en", 'dois', 'la', 'publication', '.', 'En', 'passant', 'par', 'votre', 'magnifique', 'plaidoirie', ',', 'mon', 'oeuvre', 'a', 'acquis', 'pour', 'moi', 'm', 'me', 'comme', 'une', 'autorit', 'impr', 'vue', '.', 'Acceptez', 'donc', 'ici', "l'hommage", 'de', 'ma', 'gratitude', ',', 'qui', ',', 'si', 'grande', "qu'elle", 'puisse', 'tre', ',', 'ne', 'sera', 'jamais', 'la', 'hauteur', 'de', 'votre', 'loquence', 'et', 'de', 'votre', 'd', 'vouement', '.']
En passant par votre magnifique plaidoirie, mon oeuvre a acquis pour moi m me comme une autorit impr vue.
Thanks 谢谢
The words do not actually disappear, they are just not matched by your expression: 这些单词实际上并没有消失,只是与您的表达式不匹配:
wordlist = [fixCaps(w) for w in re.findall(r"[\w']+|[.,!?;]", text)]
The escape \\w
matches "word characters", but the interpretation of what actually a 'word character' is, varies per GREP implementation: 转义符
\\w
与“单词字符”匹配,但是对于“单词字符”实际含义的解释因GREP实现而异:
\\w
stands for "word character".\\w
代表“文字字符”。 It always matches the ASCII characters[A-Za-z0-9_]
.它始终与ASCII字符
[A-Za-z0-9_]
匹配。 Notice the inclusion of the underscore and digits.请注意包含下划线和数字。 In most flavors that support Unicode,
\\w
includes many characters from other scripts.在大多数支持Unicode的版本中,
\\w
包含许多其他脚本中的字符。 There is a lot of inconsistency about which characters are actually included.关于实际包含哪些字符有很多不一致之处。
( https://www.regular-expressions.info/shorthand.html )( https://www.regular-expressions.info/shorthand.html )
By default, Python 2.7's \\w
matches only the basic limited set, but you can add flags to ask for more: 默认情况下,Python 2.7的
\\w
只匹配基本的受限集,但是您可以添加标志以要求更多:
\\w
When theLOCALE
andUNICODE
flags are not specified, matches any alphanumeric character and the underscore;如果未指定
LOCALE
和UNICODE
标志,则匹配任何字母数字字符和下划线; this is equivalent to the set[a-zA-Z0-9_]
.这等效于集合
[a-zA-Z0-9_]
。 WithLOCALE
, it will match the set[0-9_]
plus whatever characters are defined as alphanumeric for the current locale.使用
LOCALE
,它将匹配集合[0-9_]
以及当前语言环境定义为字母数字的任何字符。 IfUNICODE
is set, this will match the characters[0-9_]
plus whatever is classified as alphanumeric in the Unicode character properties database.如果设置了
UNICODE
,它将匹配字符[0-9_]
以及Unicode字符属性数据库中分类为字母数字的任何字符。
( https://docs.python.org/2/library/re.html )( https://docs.python.org/2/library/re.html )
This suggests the following code (slightly adjusted to not use a file; it's the regex that makes the difference): 这建议使用以下代码(经过微调以不使用文件;正则表达式起作用):
def wordlist(text):
regex = re.compile (r"[\w']+|[.,!?;]", re.UNICODE)
print(text)
wordlist = [fixCaps(w) for w in re.findall(regex, text)]
return wordlist
and indeed now accented characters get included: 确实包括重音符号:
['Permettez', 'moi', "d'inscrire", 'votre', 'nom', 'en', 't\xc3\xaate', 'de', 'ce', 'livre',
'et', 'au', 'dessus', 'm\xc3\xaame', 'de', 'sa', 'd\xc3', 'dicace', ';', 'car', "c'est",
...
# etc.
Python 3 and onwards have better Unicode support, so you don't have to use that flag and your original code works as expected with \\w
: Python 3及更高版本具有更好的Unicode支持,因此您不必使用该标志,并且原始代码可以与
\\w
:
['Permettez', 'moi', "d'inscrire", 'votre', 'nom', 'en', 'tête', 'de',
'ce', 'livre', 'et', 'au', 'dessus', 'même', 'de', 'sa', 'dédicace', ';',
...
(where an added bonus is that accented characters do not get output as \\x nn
). (额外的好处是重音字符不会输出为
\\x nn
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.