简体   繁体   English

从网页中仅提取有意义的文本

[英]Extracting only meaningful text from webpages

I am getting a list of urls and scraping them using nltk. 我正在获取网址列表,并使用nltk对其进行抓取。 My end result is in the form of a list with all the words on the webpage in a list. 我的最终结果是列表形式,网页上的所有单词都在列表中。 The trouble is that I am only looking for keywords and phrases that are not the usual english "sugar" words such as "as, and, like, to, am, for" etc etc. I know I can construct a file with all common english words and simply remove them from my scraped tokens list, but is there a built in feature for some library that does this automatically? 麻烦的是,我只在寻找不是常用的英语“糖”字词的关键字和词组,例如“ as,and,like,to,am,for”等。我知道我可以用所有常见的文件构造文件英文单词,然后将其从我的已删除标记列表中删除,但是某些库的内置功能会自动执行此操作吗?

I am essentially looking for useful words on a page that are not fluff and can give some context to what the page is about. 我实质上是在页面上寻找有用的单词,这些单词不是蓬松的,并且可以为页面的内容提供一些上下文。 Almost like the tags on stackoverflow or the tags google uses for seo. 就像stackoverflow上的标签或google用于seo的标签一样。

I think what you are looking for is the stopwords.words from nltk.corpus: 我认为您正在寻找的是nltk.corpus中的stopwords.words:

>>> from nltk.corpus import stopwords
>>> sw = set(stopwords.words('english'))
>>> sentence = "a long sentence that contains a for instance"
>>> [w for w in sentence.split() if w not in sw]
['long', 'sentence', 'contains', 'instance']

Edit: searching for stopword give possible duplicates: Stopword removal with NLTK , How to remove stop words using nltk or python . 编辑:搜索停用词可能重复: 使用NLTK 移除停用词如何使用nltk或python移除停用词 See the answers of these question. 请参阅这些问题的答案。 And consider Effects of Stemming on the term frequency? 并考虑词干对术语频率的影响? too

While you might get robust lists of stop-words in NLTK (and elsewhere), you can easily build your own lists according to the kind of data (register) you process. 虽然您可能会在NLTK(以及其他地方)中获得可靠的停用词列表,但是您可以根据处理的数据(寄存器)的类型轻松构建自己的列表。 Most of the words you do not want are so-called grammatical words : they are extremely frequent, so you catch them easily by sorting a frequency list by descending order and discarding the n-top items. 您不希望使用的大多数单词都是所谓的语法单词 :它们非常频繁,因此您可以通过按降序对频率列表进行排序并丢弃n首项来轻松地捕获它们。

In my experience, the first 100 ranks of any moderately large corpus (>10k tokens of running text) hardly contain any content words . 以我的经验,任何中等大的语料库(运行文本> 10k的标记)的前100个等级几乎都不会包含任何内容词

It seems that you are interested in extracting keywords , however. 但是,您似乎对提取关键字感兴趣。 For this task, pure frequency signatures are not very useful. 对于此任务,纯频率签名不是很有用。 You will need to transform the frequencies into some other value with respect to a reference corpus: this is called weighting and there are many different ways to achieve it. 您将需要将频率转换为相对于参考语料库的其他值:这称为加权,并且有很多不同的方法来实现。 TfIdf is the industry standard since 1972. TfIdf自1972年以来就是行业标准。

If you are going to spend time doing these tasks, get an introductory handbook for corpus linguistics or computational linguistics. 如果您打算花时间做这些任务,请获取有关语料库语言学或计算语言学的入门手册。

You can look for available corpora linquistics for data on frequency of words (along with other annotations). 您可以查找可用的语料库语言学,以获取有关词频的数据(以及其他注释)。

You can start from links on wikipedia: http://en.wikipedia.org/wiki/Corpus_linguistics#External_links 您可以从Wikipedia上的链接开始: http : //en.wikipedia.org/wiki/Corpus_linguistics#External_links

More information you can probably find at https://linguistics.stackexchange.com/ 您可能会在https://linguistics.stackexchange.com/上找到更多信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM