單詞列表中的 Python Unicode 規范化

Question

我正在預處理文件中的單詞列表。 我正在努力刪除重音，因為 Unicode Normalizer 僅適用於字符串。 我收到以下錯誤：

TypeError: normalize() 參數 2 必須是 str，而不是 list

有什么方法可以從整個列表中刪除重音？

非常感謝

import string
import nltk
from french_lefff_lemmatizer.french_lefff_lemmatizer import FrenchLefffLemmatizer
from nltk.corpus import stopwords
stopwords = stopwords.words('french')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
french_stopwords = nltk.corpus.stopwords.words('french')
from unicodedata import normalize
lemmatizer = FrenchLefffLemmatizer()

def preprocessing(affaires):
    preprocess_list = []
    for sentence in affaires :
        sentence_w_punct = "".join([i.lower() for i in sentence if i not in string.punctuation])
        tokenize_sentence = nltk.tokenize.word_tokenize(sentence_w_punct)
        words_w_stopwords = [i for i in tokenize_sentence if i not in french_stopwords]
        no_accent = ''.join(c for c in unicodedata.normalize('NFD', words_w_stopwords)
                  if unicodedata.category(c) != 'Mn')  
        remove_parasites = [j for j in no_accent if j not in parasites]
        words_lemmatize = (lemmatizer.lemmatize(w) for w in remove_parasites)
        sentence_clean = ' '.join(words_lemmatize)
        preprocess_list.append(sentence_clean)

    return preprocess_list

df["nom_affaire_clean"] = preprocessing(df["nom_affaire"])

cln = df.pop("nom_affaire_clean")
df.insert(1, 'nom_affaire_clean', cln )
df

Answer 1

unicodedata.normalize不適用於列表，因此請枚舉列表並轉換每個單詞：

import unicodedata as ud

words = '''âcre âge âgé arriéré arrière bronzé collé congrès coté côte côté crêpe
           crêpé cure curé dès différent diffèrent entré mémé même pâte pâté péché
           pêche pécher pêcher pécheur pêcheur prête prêté relâche relâché retraité
           sublimé vôtre'''.split()

for index, word in enumerate(words):
    words[index] = ''.join(c for c in ud.normalize('NFD', word) if ud.category(c) != 'Mn')

print(words)

輸出：

['acre', 'age', 'age', 'arriere', 'arriere', 'bronze', 'colle', 'congres', 'cote', 'cote', 'cote', 'crepe', 'crepe', 'cure', 'cure', 'des', 'different', 'different', 'entre', 'meme', 'meme', 'pate', 'pate', 'peche', 'peche', 'pecher', 'pecher', 'pecheur', 'pecheur', 'prete', 'prete', 'relache', 'relache', 'retraite', 'sublime', 'votre']

單詞列表中的 Python Unicode 規范化

問題描述

1 個解決方案

解決方案1
0 2022-06-21 16:12:54

單詞列表中的 Python Unicode 規范化

問題描述

1 個解決方案

解決方案1 0 2022-06-21 16:12:54

解決方案1
0 2022-06-21 16:12:54