简体   繁体   English

ner, spacy, 句子分割

[英]ner, spacy,sentence segmentation

I want to break this sentences in order to process it using spacy我想打破这个句子以便使用 spacy 处理它

Finally, on 1595 July 22 at 2h 40m am, when the sun was at 7° 59' 52" Leo, 101,487 distant from earth, Mars's mean longitude 11s 14° 9' 5", and anomaly 164° 48' 55", and consequent eccentric position from the vicarious hypothesis 17° 16' 36" Pisces: the apparent position of Mars, from the most select observations, was 4° 11' 10" Taurus, lat. 2° 30' S ^37. Thus we twice have Mars in the most opportune position, in quadrature with the sun, while the positions of earth and Mars are also distant by a quadrant.\n

I want to result be like this:我希望结果是这样的:

[
Finally, on 1595 July 22 at 2h 40m am, when the sun was at 7° 59' 52" Leo, 101,487 distant from earth, Mars's mean longitude 11s 14° 9' 5", and anomaly 164° 48' 55", and consequent eccentric position from the vicarious hypothesis 17° 16' 36" Pisces: the apparent position of Mars, from the most select observations, was 4° 11' 10" Taurus, lat. 2° 30' S ^37. ,

  Thus we twice have Mars in the most opportune position, in quadrature with the sun, while the positions of earth and Mars are also distant by a quadrant.\n ]

It means two sentences, the first one should finish after lat.这意味着两句话,第一句话应该在lat之后完成。 2° 30' S ^37. 2° 30' S ^37。 but since lat.但自从纬度。 has a dote, it breaks the sentences after lat.有一个溺爱,它打破了lat之后的句子。

but I did not find the solution till now I have used these 2 approaches:但是直到现在我才找到解决方案,我使用了这两种方法:

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text in ("lat."):
            # print("Detected:", token.text)
            doc[token.i].is_sent_start = False
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
nlp.pipeline

and

a.split('.')

I think some small mistakes in the first code.我认为第一个代码中有一些小错误。

both above methods do not work to split the sentences as desired!以上两种方法都不能根据需要拆分句子!

generally, what do you recommend in order to segment paragraph to sentences?一般来说,你有什么建议将段落分割成句子? (especially when we have) such abbreviation cases lie (尤其是当我们有)这种缩写的情况下

lat. 

I have used this and it works我已经使用了它并且它有效

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
def SentenceSegmentation(Para):
        punkt_param = PunktParameters()
        abbreviation = ['lat', 'ch']  #any abbrivation  lat-> latitiude  ch--> chapter
        punkt_param.abbrev_types = set(abbreviation)
        tokenizer = PunktSentenceTokenizer(punkt_param)
        tokenizer.train(Para)
        return tokenizer.tokenize(Para)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM