[英]Python Unicode Normalize in list of words
我正在预处理文件中的单词列表。 我正在努力删除重音,因为 Unicode Normalizer 仅适用于字符串。 我收到以下错误:
TypeError: normalize() 参数 2 必须是 str,而不是 list
有什么方法可以从整个列表中删除重音?
非常感谢
import string
import nltk
from french_lefff_lemmatizer.french_lefff_lemmatizer import FrenchLefffLemmatizer
from nltk.corpus import stopwords
stopwords = stopwords.words('french')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
french_stopwords = nltk.corpus.stopwords.words('french')
from unicodedata import normalize
lemmatizer = FrenchLefffLemmatizer()
def preprocessing(affaires):
preprocess_list = []
for sentence in affaires :
sentence_w_punct = "".join([i.lower() for i in sentence if i not in string.punctuation])
tokenize_sentence = nltk.tokenize.word_tokenize(sentence_w_punct)
words_w_stopwords = [i for i in tokenize_sentence if i not in french_stopwords]
no_accent = ''.join(c for c in unicodedata.normalize('NFD', words_w_stopwords)
if unicodedata.category(c) != 'Mn')
remove_parasites = [j for j in no_accent if j not in parasites]
words_lemmatize = (lemmatizer.lemmatize(w) for w in remove_parasites)
sentence_clean = ' '.join(words_lemmatize)
preprocess_list.append(sentence_clean)
return preprocess_list
df["nom_affaire_clean"] = preprocessing(df["nom_affaire"])
cln = df.pop("nom_affaire_clean")
df.insert(1, 'nom_affaire_clean', cln )
df
unicodedata.normalize
不适用于列表,因此请枚举列表并转换每个单词:
import unicodedata as ud
words = '''âcre âge âgé arriéré arrière bronzé collé congrès coté côte côté crêpe
crêpé cure curé dès différent diffèrent entré mémé même pâte pâté péché
pêche pécher pêcher pécheur pêcheur prête prêté relâche relâché retraité
sublimé vôtre'''.split()
for index, word in enumerate(words):
words[index] = ''.join(c for c in ud.normalize('NFD', word) if ud.category(c) != 'Mn')
print(words)
输出:
['acre', 'age', 'age', 'arriere', 'arriere', 'bronze', 'colle', 'congres', 'cote', 'cote', 'cote', 'crepe', 'crepe', 'cure', 'cure', 'des', 'different', 'different', 'entre', 'meme', 'meme', 'pate', 'pate', 'peche', 'peche', 'pecher', 'pecher', 'pecheur', 'pecheur', 'prete', 'prete', 'relache', 'relache', 'retraite', 'sublime', 'votre']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.