繁体   English   中英

文本预处理翻译错误 Python

[英]Text Preprocessing Translation Error Python

我试图使用深度翻译器翻译推文文本,但我发现了一些问题。 在翻译文本之前,我做了一些文本预处理,例如清理、删除表情符号等。这是预处理的 ddefined 函数:

def deEmojify(text):
    regrex_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return regrex_pattern.sub(r'',text)
def cleaningText(text):
    text = re.sub(r'@[A-Za-z0-9]+', '', text) # remove mentions
    text = re.sub(r'#[A-Za-z0-9]+', '', text) # remove hashtag
    text = re.sub(r'RT[\s]', '', text) # remove RT
    text = re.sub(r"http\S+", '', text) # remove link
    text = re.sub(r"[!@#$]", '', text) # remove link
    text = re.sub(r'[0-9]+', '', text) # remove numbers

    text = text.replace('\n', ' ') # replace new line into space
    text = text.translate(str.maketrans('', '', string.punctuation)) # remove all punctuations
    text = text.strip(' ') # remove characters space from both left and right text
    return text

def casefoldingText(text): # Converting all the characters in a text into lower case
    text = text.lower() 
    return text

def tokenizingText(text): # Tokenizing or splitting a string, text into a list of tokens
    text = word_tokenize(text) 
    return text

def filteringText(text): # Remove stopwors in a text
    listStopwords = set(stopwords.words('indonesian'))
    filtered = []
    for txt in text:
        if txt not in listStopwords:
            filtered.append(txt)
    text = filtered 
    return text
def stemmingText(text): # Reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()
    text = [stemmer.stem(word) for word in text]
    return text
def convert_eng(text):
    text = GoogleTranslator(source='auto', target='en').translate_batch(text)
    return text

这是翻译功能:

def convert_eng(text):
    text = GoogleTranslator(source='auto', target='en').translate(text)
    return text

这是预期结果的示例(印度尼西亚语文本)

text = '@jshuahaee Ketemu agnes mo lagii😍😍'
clean = cleaningText(text)
print('After cleaning ==> ', clean)
emoji = deEmojify(clean)
print('After emoji ==> ', emoji)
cf = casefoldingText(emoji)
print('After case folding ==> ', cf)
token = tokenizingText(cf)
print('After token ==> ', token)
filter= filteringText(token)
print('After filter ==> ', filter)

stem = stemmingText(filter)
print('After Stem ==> ', stem)

en = convert_eng(stem)
print('After translate ==> ', en)

结果 :

After cleaning ==>  Ketemu agnes mo lagii😍😍
After emoji ==>  Ketemu agnes mo lagii
After case folding ==>  ketemu agnes mo lagii
After token ==>  ['ketemu', 'agnes', 'mo', 'lagii']
After filter ==>  ['ketemu', 'agnes', 'mo', 'lagii']
After Stem ==>  ['ketemu', 'agnes', 'mo', 'lagi']
After translate ==>  ['meet', 'agnes', 'mo', 'again']

但是,当句子包含一些点时,我发现了问题,在词干处理后文本包含 [''] 时发生错误(我不知道如何称呼它)

text = 'News update Meski kurang diaspirasi Shoppee yg korea minded  dalam waktu indonesa belaja di bulan November Lazada 1… '
clean = cleaningText(text)
print('After cleaning ==> ', clean)
emoji = deEmojify(clean)
print('After emoji ==> ', emoji)
cf = casefoldingText(emoji)
print('After case folding ==> ', cf)
token = tokenizingText(cf)
print('After token ==> ', token)
filter= filteringText(token)
print('After filter ==> ', filter)

stem = stemmingText(filter)
print('After Stem ==> ', stem)

en = convert_eng(stem)
print('After translate ==> ', en)

结果

After cleaning ==>  News update Meski kurang diaspirasi Shoppee yg korea minded  dalam waktu indonesa belaja di bulan November Lazada …
After emoji ==>  News update Meski kurang diaspirasi Shoppee yg korea minded  dalam waktu indonesa belaja di bulan November Lazada …
After case folding ==>  news update meski kurang diaspirasi shoppee yg korea minded  dalam waktu indonesa belaja di bulan november lazada …
After token ==>  ['news', 'update', 'meski', 'kurang', 'diaspirasi', 'shoppee', 'yg', 'korea', 'minded', 'dalam', 'waktu', 'indonesa', 'belaja', 'di', 'bulan', 'november', 'lazada', '…']
After filter ==>  ['news', 'update', 'diaspirasi', 'shoppee', 'yg', 'korea', 'minded', 'indonesa', 'belaja', 'november', 'lazada', '…']
After Stem ==>  ['news', 'update', 'aspirasi', 'shoppee', 'yg', 'korea', 'minded', 'indonesa', 'baja', 'november', 'lazada', '']

这是错误信息

NotValidPayload                           Traceback (most recent call last)
<ipython-input-40-cb9390422d3c> in <module>
     14 print('After Stem ==> ', stem)
     15 
---> 16 en = convert_eng(stem)
     17 print('After translate ==> ', en)

<ipython-input-28-28bc36c96914> in convert_eng(text)
      8     return text
      9 def convert_eng(text):
---> 10     text = GoogleTranslator(source='auto', target='en').translate_batch(text)
     11     return text

C:\Python\lib\site-packages\deep_translator\google_trans.py in translate_batch(self, batch, **kwargs)
    195         for i, text in enumerate(batch):
    196 
--> 197             translated = self.translate(text, **kwargs)
    198             arr.append(translated)
    199         return arr

C:\Python\lib\site-packages\deep_translator\google_trans.py in translate(self, text, **kwargs)
    108         """
    109 
--> 110         if self._validate_payload(text):
    111             text = text.strip()
    112 

C:\Python\lib\site-packages\deep_translator\parent.py in _validate_payload(payload, min_chars, max_chars)
     44 
     45         if not payload or not isinstance(payload, str) or not payload.strip() or payload.isdigit():
---> 46             raise NotValidPayload(payload)
     47 
     48         # check if payload contains only symbols

NotValidPayload:  --> text must be a valid text with maximum 5000 character, otherwise it cannot be translated

我的想法是删除'' ,我认为这是问题所在,但我不知道该怎么做。 任何人,请帮助我

您需要在代码中引入一些错误检查,并且只处理预期的数据类型。 您的convert_eng函数需要一个非空字符串作为参数(查看if not payload or not isinstance(payload, str) or not payload.strip() or payload.isdigit(): part),并且您的stem包含一个空字符串作为列表中的最后一项。

此外, filteringText(text)可能会返回[]因为所有单词都可能成为停用词。 另外,不要使用filter作为变量的名称,它是内置的。

所以,改变

filter= filteringText(token)
print('After filter ==> ', filter)

stem = stemmingText(filter)
print('After Stem ==> ', stem)

filter1 = filteringText(token)
print('After filter ==> ', filter1)

if filter1:
    stem = stemmingText(filter1)
    print('After Stem ==> ', stem)
    en = [] # as stem_cleaned as a list
    for i in stem_cleaned:
        if len(i.strip()) > 0:        # If the item length is greater than 0
            en.append(convert_eng(i)) # Translate and append to 'en'

    print('After translate ==> ', en)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM