[英]NLTK punkt sentence tokenizer splitting on numeric bullets
I am using nltk PunktSentenceTokenizer for splitting paragraphs into sentences. 我正在使用nltk PunktSentenceTokenizer将段落拆分为句子。 I have paragraphs as follows: 我的段落如下:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work" 段落=“ 1.候选人的数学很差。2.人际交往能力很好。3.对社会工作非常热心”
Output: ['1.', 'Candidate is very poor in mathematics.', '2.', 'Interpersonal skills are good.', '3.', 'Very enthusiastic about social work'] 输出: ['1。','候选人在数学方面非常差。','2。','人际交往能力很好。','3。','非常热衷于社会工作']
I tried to add sent starters using below code but that didnt even work out. 我试图使用下面的代码添加发送的启动器,但是甚至没有解决。
from nltk.tokenize.punkt import PunktSentenceTokenizer
tokenizer = PunktSentenceTokenizer()
tokenizer._params.sent_starters.add('1.')
I really appreciate if anybody could drive me towards correct direction 我真的很感谢有人能带我朝正确的方向前进
Thanks in advance :) 提前致谢 :)
The use of regular expressions can provide a solution to this type of problem, as illustrated by the code below: 正则表达式的使用可以为此类问题提供解决方案,如以下代码所示:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
import re
reSentenceEnd = re.compile("\.|$")
reAtLeastTwoLetters = re.compile("[a-zA-Z]{2}")
previousMatch = 0
sentenceStart = 0
end = len(paragraphs)
while(True):
candidateSentenceEnd = reSentenceEnd.search(paragraphs, previousMatch)
# A sentence must contain at least two consecutive letters:
if reAtLeastTwoLetters.search(paragraphs[sentenceStart:candidateSentenceEnd.end()]) :
print(paragraphs[sentenceStart:candidateSentenceEnd.end()])
sentenceStart = candidateSentenceEnd.end()
if candidateSentenceEnd.end() == end:
break
previousMatch=candidateSentenceEnd.start() + 1
the output is: 输出为:
- Candidate is very poor in mathematics. 候选人的数学很差。
- Interpersonal skills are good. 人际交往能力很好。
- Very enthusiastic about social work 对社会工作非常热心
Many tokenizers including (nltk and Spacy) can handle regular expressions. 包括(nltk和Spacy)在内的许多标记器都可以处理正则表达式。 Adapting this code to their framework might not be trivial though. 但是,使此代码适应其框架可能并不容易。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.