Python NLTK - 將句子標記為單詞，同時刪除數字

Question

希望有人可以提供幫助。 我有一個從文本文件中讀取的句子列表，我正在嘗試將句子標記為單詞。 同時還刪除僅包含數字的句子。 數字何時出現沒有規律。

我有的句子：

[
  ['                    1'], 
  ['This is a text file,'], 
  ['to keep the words,'],
  ['                    2'],
  ['Another line of the text:'],
  ['                    3']
]

所需的 output：

[
  ['This', 'is', 'a', 'text', 'file,'], 
  ['to', 'keep', 'the', 'words,'],
  ['Another', 'line', 'of', 'the', 'text:'],
]

Answer 1

經過一些預處理后，現在您可以應用標記化

import re

a = [
    ['                    1'],
    ['This is a text file,'],
    ['to keep the words,'],
    ['                    2'],
    ['Another line of the text:'],
    ['                    3']
]


def replace_digit(string):
    return re.sub(r'\d', '', string).strip()


data = []
process = [replace_digit(i[0]) for i in a]
filtered = filter(lambda x: x, process)
tokenize = map(lambda x: x.split(), filtered)
print(list(tokenize))

Python NLTK - 將句子標記為單詞，同時刪除數字

問題描述

1 個解決方案

解決方案1
0 2022-02-04 05:19:36

Python NLTK - 將句子標記為單詞，同時刪除數字

問題描述

1 個解決方案

解決方案1 0 2022-02-04 05:19:36

解決方案1
0 2022-02-04 05:19:36