简体   繁体   English

NLTK punkt句子标记器在数字项目符号上拆分

[英]NLTK punkt sentence tokenizer splitting on numeric bullets

I am using nltk PunktSentenceTokenizer for splitting paragraphs into sentences. 我正在使用nltk PunktSentenceTokenizer将段落拆分为句子。 I have paragraphs as follows: 我的段落如下:

paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work" 段落=“ 1.候选人的数学很差。2.人际交往能力很好。3.对社会工作非常热心”

Output: ['1.', 'Candidate is very poor in mathematics.', '2.', 'Interpersonal skills are good.', '3.', 'Very enthusiastic about social work'] 输出: ['1。','候选人在数学方面非常差。','2。','人际交往能力很好。','3。','非常热衷于社会工作']

I tried to add sent starters using below code but that didnt even work out. 我试图使用下面的代码添加发送的启动器,但是甚至没有解决。

from nltk.tokenize.punkt import PunktSentenceTokenizer
tokenizer = PunktSentenceTokenizer()
tokenizer._params.sent_starters.add('1.')

I really appreciate if anybody could drive me towards correct direction 我真的很感谢有人能带我朝正确的方向前进

Thanks in advance :) 提前致谢 :)

The use of regular expressions can provide a solution to this type of problem, as illustrated by the code below: 正则表达式的使用可以为此类问题提供解决方案,如以下代码所示:

paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work" 

import re
reSentenceEnd = re.compile("\.|$")
reAtLeastTwoLetters = re.compile("[a-zA-Z]{2}")

previousMatch = 0
sentenceStart = 0
end = len(paragraphs)
while(True):
    candidateSentenceEnd = reSentenceEnd.search(paragraphs, previousMatch)

    # A sentence must contain at least two consecutive letters:
    if reAtLeastTwoLetters.search(paragraphs[sentenceStart:candidateSentenceEnd.end()]) :
        print(paragraphs[sentenceStart:candidateSentenceEnd.end()])
        sentenceStart = candidateSentenceEnd.end()

    if candidateSentenceEnd.end() == end:
        break
    previousMatch=candidateSentenceEnd.start() + 1



the output is: 输出为:

  1. Candidate is very poor in mathematics. 候选人的数学很差。
  2. Interpersonal skills are good. 人际交往能力很好。
  3. Very enthusiastic about social work 对社会工作非常热心

Many tokenizers including (nltk and Spacy) can handle regular expressions. 包括(nltk和Spacy)在内的许多标记器都可以处理正则表达式。 Adapting this code to their framework might not be trivial though. 但是,使此代码适应其框架可能并不容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM