简体   繁体   English

如何从文章中提取(识别)书名?

[英]How to extract (recognize) book title from the article?

Is there any good method to extract (recognize) book title from the article using nltk or something else? 是否有使用nltk或其他方法从文章中提取(识别)书名的好方法?

I can recognize author names using nltk, so my idea is to get list of book titles with authors from some external source and when I recognize author name then I could take a list of this author's books from external source and look for them in the text. 我可以使用nltk识别作者的名字,所以我的想法是从某个外部来源获得作者的书名列表,当我识别了作者的名字时,我可以从外部来源获取该作者的书籍清单并在文本中查找。

but I'm not convinced about this solution because I need external source with all books and I don't have such a source and this solution seems a bit like "brute force" for me. 但是我对这种解决方案不满意,因为我需要所有书籍的外部资源,而我没有这样的资源,而且这种解决方案对我来说有点像“蛮力”。

can you direct me on topics that will help me with this problem? 您能指导我解决对这个问题有帮助的主题吗?

Given sufficient training data, there is a wonderful python library for achieving things like this called https://github.com/snipsco/snips-nlu 如果有足够的培训数据,那么有一个很棒的python库可以实现类似https://github.com/snipsco/snips-nlu这样的功能

What you might want to do is grab examples from as many articles that include book titles as you can, follow the documentation on that repository, and you should be able to glean book titles from articles assuming they follow a similar pattern as your example data. 您可能想要做的是从尽可能多的文章(包括书名)中获取示例,并遵循该存储库中的文档,并且假设它们遵循与示例数据类似的模式,您应该能够从文章中收集书名。

I'm not 100% positive that this is a task for machine learning however. 我不是100%肯定这是机器学习的任务。 There may be an easier way, such as looking for words/phrases that are in quotes, are italicized, etc. Humans don't necessarily know that a bunch of words are the title of a book, so we invented punctuation to explicitly make that clear. 可能有一种更简单的方法,例如查找引号中的单词/短语,斜体等。人类不一定知道一堆单词是一本书的标题,因此我们发明了标点符号来明确地使明确。 It seems to me there should be some use for that syntax in your solution if possible. 在我看来,如果可能的话,您的解决方案中应该对该语法进行一些使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM