简体繁体 English

如何使用自然语言处理或其他技术从html提取实体

[英]How to extract entities from html using natural language processing or other technique

原文 2013-11-21 17:55:41 6 1 machine-learning/ nlp/ named-entity-extraction

I am trying to parse entities from web pages that contain a time, a place, and a name. 我正在尝试从包含时间，地点和名称的网页中解析实体。 I read a little about natural language processing, and entity extraction, but I am not sure if I am heading down the wrong path, so I am asking here. 我读了一些有关自然语言处理和实体提取的文章，但是我不确定是否要走错路，所以我在这里问。

I haven't started implementing anything yet, so if certain open source libraries are only suitable for a specific language, that is ok. 我还没有开始实现任何东西，因此，如果某些开源库仅适用于特定语言，那就可以了。

A lot of times the data would not be found in sentences, but instead in html structures like lists (eg 很多时候，数据不会在句子中找到，而是在诸如列表之类的html结构中找到（例如

2013-02-01 - Name of Event - Arena Name 2013-02-01-活动名称-竞技场名称

). ）。

The structure of the webpages will be vastly different (some might use lists, some might put them in a table, etc.). 网页的结构将有很大的不同（有些可能使用列表，有些可能将它们放在表格中，等等）。

What topics can I research to learn more about how to achieve this? 我可以研究哪些主题以了解有关如何实现此目标的更多信息？ Are there any open source libraries that take into account the structure of html when doing entity extraction? 在进行实体提取时，是否有任何开源库考虑html的结构？ Would extracting these (name, time, place) entities from html be better (or even possible) with machine vision where the CSS styling might make it easier to differentiate important parts (name, time, location) of the unstructured text? 从CSS样式可以更轻松地区分非结构化文本的重要部分（名称，时间，位置）的机器视觉中，从html提取这些（名称，时间，位置）实体会更好（甚至可能）吗？

Any guidance on topics/open source projects that I can research would help I think. 我认为可以研究的有关主题/开源项目的任何指南都会有所帮助。

1 个解决方案

Many programming languages have external libraries that generate canonical date-stamps from various formats (eg in Java, using the SimpleDateFormat ). 许多编程语言都有外部库，这些库可以从各种格式（例如，在Java中，使用SimpleDateFormat ）生成规范的日期戳。 As you say, the structure of the web-pages will be vastly different, but date can be expressed using a small number of variations only, so writing down the regular expressiongs for a few (let's say, half-a-dozen) formats will enable extraction of dates from most, if not all, HTML pages. 就像您说的那样，网页的结构将有很大的不同，但是日期只能使用少量的变体来表示，因此写下几种（例如，六种）格式的正则表达式会启用从大多数（如果不是全部）HTML页面中提取日期的功能。

Extraction of places and names is harder, however. 但是，提取位置和名称比较困难。 This is where natural language processing will have to come in. What you are looking for is a Named Entity Recognition system. 这是必须进行自然语言处理的地方。您正在寻找的是命名实体识别系统。 One of the best open source NER systems is the Standford NER . 最好的开源NER系统之一是Standford NER 。 Before using, you should check out their online demo . 使用之前，您应该查看他们的在线演示。 The demo has three classifiers (for English) that you can choose from. 该演示有三个分类器（英语），您可以选择。 For most of my tasks, I find their english.all.3class.distsim classifier to be quite accurate. 对于我的大多数任务，我发现他们的english.all.3class.distsim分类器非常准确。

Note that an NER performs well when the places and names you extract are occurring in sentences. 请注意，当您提取的位置和名称出现在句子中时，NER表现良好。 If they are going to occur in HTML labels, this approach is probably not going to be very helpful. 如果它们将出现在HTML标签中，则此方法可能不会很有用。