简体繁体 English

如何在NLP框架中执行段落边界检测？

[英]How to perform Paragraph boundary detection in NLP frameworks?

原文 2013-11-19 11:04:18 1 1 nlp/ text-processing/ stanford-nlp/ opennlp/ apache-stanbol

I am working on extracting names of people from various ads appearing in English newspapers . 我正在努力从英文报纸上出现的各种广告中提取人名。

However , i have noticed that I need to identify the boundary of an Ad , before extracting the names present in it ,since I need only the first occurring name to be extracted .I started with Stanford NLP . 但是，我注意到我需要在提取其中存在的名称之前识别Ad的边界，因为我只需要提取第一个出现的名称。我从Stanford NLP开始。 I was successful in extracting names . 我成功地提取了名字。 But I got stuck in identifying the paragraph boundary. 但我一直在确定段落边界。

Is there any way of identifying the paragraph boundary . 有没有办法确定段落边界。 ? ？

1 个解决方案

This is a difficult problem, we are facing the same problem in one of our projects. 这是一个难题，我们在其中一个项目中遇到了同样的问题。 There are some theory papers out there which help define the scope of the problem and potential solutions in detail. 有一些理论论文有助于详细定义问题的范围和潜在的解决方案。 I'll include them below. 我会把它们包括在下面。

We're still in the process of R&D so we haven't many answers just yet, but we are willing to share what we have and find as time moves forward. 我们仍处于研发阶段，所以我们还没有多少答案，但我们愿意分享我们所拥有的，并随着时间的推移发现。

Here is one such paper: 这是一篇这样的论文：

Automatic Paragraph Identification: A Study across Languages and Domains 自动段落识别：跨语言和域名的研究

Here is the github link for the ISCIBoost Code they use: 以下是他们使用的ISCIBoost代码的github链接：

Open-source implementation of Boostexter (Adaboost based classifier) Boostexter（基于Adaboost的分类器）的开源实现