简体   繁体   English

将文本转换为单词表时,如何保留法语特殊字符?

[英]How can I preserve the French special characters when transforming a text into a wordlist?

I am using a snippet of code for a markov chain sentence generator. 我正在为markov链语句生成器使用一小段代码。 Works fine in English, but in French, it doesn't print out the special characters (é, è, etc.). 使用英语工作正常,但是使用法语时,不会打印出特殊字符(é,è等)。

This is the part that reads a file and creates a wordlist from it. 这是读取文件并从中创建单词表的部分。 I use the print statements as controls, which allow me to see that print(text) prints the special characters, but once the word is added to the wordlist, they disappear. 我将打印语句用作控件,这使我可以看到print(text)打印特殊字符,但是将单词添加到单词列表后,它们便消失了。

def wordlist(filename):
    f = open(filename, mode='r')
    text = f.read()
    print(text)
    wordlist = [fixCaps(w) for w in re.findall(r"[\w']+|[.,!?;]", text)]
    print(wordlist)
    f.close()
    return wordlist

How can I preserve the special characters when creating the word list? 创建单词列表时如何保留特殊字符? (I am running this on Windows 7 with Python 2.x) (我正在Windows 7上使用Python 2.x运行它)

Example of output: 输出示例:

Permettez-moi d'inscrire votre nom en tête de ce livre et au-
dessus même de sa dédicace; car c'est à vous, surtout, que j'en
dois la publication. En passant par votre magnifique plaidoirie,
mon oeuvre a acquis pour moi-même comme une autorité imprévue.
Acceptez donc ici l'hommage de ma gratitude, qui, si grande
qu'elle puisse être, ne sera jamais à la hauteur de votre
éloquence et de votre dévouement.
['Permettez', 'moi', "d'inscrire", 'votre', 'nom', 'en', 't', 'te', 'de', 'ce', 'livre', 'et', 'au', 'dessus', 'm', 'me', 'de', 'sa', 'd', 'dicace', ';', 'car', "c'est", 'vous', ',', 'surtout', ',', 'que', "j'en", 'dois', 'la', 'publication', '.', 'En', 'passant', 'par', 'votre', 'magnifique', 'plaidoirie', ',', 'mon', 'oeuvre', 'a', 'acquis', 'pour', 'moi', 'm', 'me', 'comme', 'une', 'autorit', 'impr', 'vue', '.', 'Acceptez', 'donc', 'ici', "l'hommage", 'de', 'ma', 'gratitude', ',', 'qui', ',', 'si', 'grande', "qu'elle", 'puisse', 'tre', ',', 'ne', 'sera', 'jamais', 'la', 'hauteur', 'de', 'votre', 'loquence', 'et', 'de', 'votre', 'd', 'vouement', '.']
En passant par votre magnifique plaidoirie, mon oeuvre a acquis pour moi m me comme une autorit impr vue.

Thanks 谢谢

The words do not actually disappear, they are just not matched by your expression: 这些单词实际上并没有消失,只是与您的表达式不匹配:

wordlist = [fixCaps(w) for w in re.findall(r"[\w']+|[.,!?;]", text)]

The escape \\w matches "word characters", but the interpretation of what actually a 'word character' is, varies per GREP implementation: 转义符\\w与“单词字符”匹配,但是对于“单词字符”实际含义的解释因GREP实现而异:

\\w stands for "word character". \\w代表“文字字符”。 It always matches the ASCII characters [A-Za-z0-9_] . 它始终与ASCII字符[A-Za-z0-9_]匹配。 Notice the inclusion of the underscore and digits. 请注意包含下划线和数字。 In most flavors that support Unicode, \\w includes many characters from other scripts. 在大多数支持Unicode的版本中, \\w包含许多其他脚本中的字符。 There is a lot of inconsistency about which characters are actually included. 关于实际包含哪些字符有很多不一致之处。
( https://www.regular-expressions.info/shorthand.html ) https://www.regular-expressions.info/shorthand.html

By default, Python 2.7's \\w matches only the basic limited set, but you can add flags to ask for more: 默认情况下,Python 2.7的\\w 匹配基本的受限集,但是您可以添加标志以要求更多:

\\w
When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; 如果未指定LOCALEUNICODE标志,则匹配任何字母数字字符和下划线; this is equivalent to the set [a-zA-Z0-9_] . 这等效于集合[a-zA-Z0-9_] With LOCALE , it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. 使用LOCALE ,它将匹配集合[0-9_]以及当前语言环境定义为字母数字的任何字符。 If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database. 如果设置了UNICODE ,它将匹配字符[0-9_]以及Unicode字符属性数据库中分类为字母数字的任何字符。
( https://docs.python.org/2/library/re.html ) https://docs.python.org/2/library/re.html

This suggests the following code (slightly adjusted to not use a file; it's the regex that makes the difference): 这建议使用以下代码(经过微调以不使用文件;正则表达式起作用):

def wordlist(text):
    regex = re.compile (r"[\w']+|[.,!?;]", re.UNICODE)
    print(text)
    wordlist = [fixCaps(w) for w in re.findall(regex, text)]
    return wordlist

and indeed now accented characters get included: 确实包括重音符号:

['Permettez', 'moi', "d'inscrire", 'votre', 'nom', 'en', 't\xc3\xaate', 'de', 'ce', 'livre',
 'et', 'au', 'dessus', 'm\xc3\xaame', 'de', 'sa', 'd\xc3', 'dicace', ';', 'car', "c'est",
...
# etc.

Python 3 and onwards have better Unicode support, so you don't have to use that flag and your original code works as expected with \\w : Python 3及更高版本具有更好的Unicode支持,因此您不必使用该标志,并且原始代码可以与\\w

['Permettez', 'moi', "d'inscrire", 'votre', 'nom', 'en', 'tête', 'de',
 'ce', 'livre', 'et', 'au', 'dessus', 'même', 'de', 'sa', 'dédicace', ';',
   ...

(where an added bonus is that accented characters do not get output as \\x nn ). (额外的好处是重音字符不会输出为\\x nn )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM