简体   繁体   English

Python NLTK - 将句子标记为单词,同时删除数字

[英]Python NLTK - Tokenize sentences into words while removing numbers

hoping someone can assist with this.希望有人可以提供帮助。 I have a list of sentences which is read from a text file, I am trying to tokenize the sentences into words.我有一个从文本文件中读取的句子列表,我正在尝试将句子标记为单词。 while also removing sentences while contain only numbers.同时还删除仅包含数字的句子。 There is no pattern for when the numbers will appear.数字何时出现没有规律。

The sentences I have:我有的句子:

[
  ['                    1'], 
  ['This is a text file,'], 
  ['to keep the words,'],
  ['                    2'],
  ['Another line of the text:'],
  ['                    3']
]

Desired output:所需的 output:

[
  ['This', 'is', 'a', 'text', 'file,'], 
  ['to', 'keep', 'the', 'words,'],
  ['Another', 'line', 'of', 'the', 'text:'],
]

After some pre processing, now you can apply tokenizing经过一些预处理后,现在您可以应用标记化

import re

a = [
    ['                    1'],
    ['This is a text file,'],
    ['to keep the words,'],
    ['                    2'],
    ['Another line of the text:'],
    ['                    3']
]


def replace_digit(string):
    return re.sub(r'\d', '', string).strip()


data = []
process = [replace_digit(i[0]) for i in a]
filtered = filter(lambda x: x, process)
tokenize = map(lambda x: x.split(), filtered)
print(list(tokenize))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM