简体   繁体   English

从文本生成关键字的简单方法是什么?

[英]What is a simple way to generate keywords from a text?

I suppose I could take a text and remove high frequency English words from it. 我想我可以拿一个文本并从中删除高频英语单词。 By keywords, I mean that I want to extract words that are most the characterizing of the content of the text (tags ) . 通过关键字,我的意思是我想提取最能代表文本(标签)内容的单词。 It doesn't have to be perfect, a good approximation is perfect for my needs. 它不一定是完美的,一个很好的近似是完美的满足我的需求。

Has anyone done anything like that? 有人做过这样的事吗? Do you known a Perl or Python library that does that? 你知道Perl或Python库吗?

Lingua::EN::Tagger is exactly what I asked however I needed a library that could work for french text too. Lingua :: EN :: Tagger正是我所要求的,但我需要一个可以用于法语文本的库。

The name for the "high frequency English words" is stop words and there are many lists available. “高频英语单词”的名称是停用词 ,有许多列表可用。 I'm not aware of any python or perl libraries, but you could encode your stop word list in a binary tree or hash (or you could use python's frozenset), then as you read each word from the input text, check if it is in your 'stop list' and filter it out. 我不知道任何python或perl库,但你可以在二叉树或散列中编码你的停用词列表(或者你可以使用python的冷冻集),然后当你从输入文本中读取每个单词时,检查它是否是在你的“停止列表”中过滤掉它。

Note that after you remove the stop words, you'll need to do some stemming to normalize the resulting text (remove plurals, -ings, -eds), then remove all the duplicate "keywords". 请注意,在删除停用词后,您需要执行一些词干来规范化生成的文本(删除复数,-ings,-eds),然后删除所有重复的“关键字”。

You could try using the perl module Lingua::EN::Tagger for a quick and easy solution. 您可以尝试使用perl模块Lingua :: EN :: Tagger,以获得快速简便的解决方案。

A more complicated module Lingua::EN::Semtags::Engine uses Lingua::EN::Tagger with a WordNet database to get a more structured output. 一个更复杂的模块Lingua :: EN :: Semtags :: Engine使用Lingua :: EN :: Tagger和WordNet数据库来获得更结构化的输出。 Both are pretty easy to use, just check out the documentation on CPAN or use perldoc after you install the module. 两者都非常易于使用,只需查看CPAN上的文档或在安装模块后使用perldoc。

To find the most frequently-used words in a text, do something like this: 要查找文本中最常用的单词,请执行以下操作:

#!/usr/bin/perl -w

use strict;
use warnings 'all';

# Read the text:
open my $ifh, '<', 'text.txt'
  or die "Cannot open file: $!";
local $/;
my $text = <$ifh>;

# Find all the words, and count how many times they appear:
my %words = ( );
map { $words{$_}++ }
  grep { length > 1 && $_ =~ m/^[\@a-z-']+$/i }
    map { s/[",\.]//g; $_ }
      split /\s/, $text;

print "Words, sorted by frequency:\n";
my (@data_line);
format FMT = 
@<<<<<<<<<<<<<<<<<<<<<<...     @########
@data_line
.
local $~ = 'FMT';

# Sort them by frequency:
map { @data_line = ($_, $words{$_}); write(); }
  sort { $words{$b} <=> $words{$a} }
    grep { $words{$_} > 2 }
      keys(%words);

Example output looks like this: 示例输出如下所示:

john@ubuntu-pc1:~/Desktop$ perl frequency.pl 
Words, sorted by frequency:
for                                   32
Jan                                   27
am                                    26
of                                    21
your                                  21
to                                    18
in                                    17
the                                   17
Get                                   13
you                                   13
OTRS                                  11
today                                 11
PSM                                   10
Card                                  10
me                                     9
on                                     9
and                                    9
Offline                                9
with                                   9
Invited                                9
Black                                  8
get                                    8
Web                                    7
Starred                                7
All                                    7
View                                   7
Obama                                  7

在Perl中有Lingua :: EN :: Keywords

The simplest way to do what you want is this... 做你想做的最简单的方法是......

>>> text = "this is some of the sample text"
>>> words = [word for word in set(text.split(" ")) if len(word) > 3]
>>> words
['this', 'some', 'sample', 'text']

I don't know of any standard module that does this, but it wouldn't be hard to replace the limit on three letter words with a lookup into a set of common English words. 我不知道有任何标准模块可以做到这一点,但是通过查找一组常用英语单词来替换三个字母单词的限制并不困难。

One liner solution (words longer than two chars which occurred more than two times): 一个班轮解决方案(长于两个字符的字数超过两次):

perl -ne'$h{$1}++while m/\b(\w{3,})\b/g}{printf"%-20s %5d\n",$_,$h{$_}for sort{$h{$b}<=>$h{$a}}grep{$h{$_}>2}keys%h'

EDIT: If one wants to sort alphabetically words with same frequency can use this enhanced one: 编辑:如果想要按字母顺序排序具有相同频率的单词可以使用此增强的单词:

perl -ne'$h{$1}++while m/\b(\w{3,})\b/g}{printf"%-20s %5d\n",$_,$h{$_}for sort{$h{$b}<=>$h{$a}or$a cmp$b}grep{$h{$_}>2}keys%h'

TF-IDF (Term Frequency - Inverse Document Frequency) is designed for this. TF-IDF(术语频率 - 反向文档频率)就是为此而设计的。

Basically it asks, which words are frequent in this document, compared to all documents? 基本上它要求,与所有文件相比,本文件中经常出现哪些词语?

It will give a lower score to words that appear in all documents, and a higher score to words that appear in a given document frequently. 它会对出现在所有文档中的单词给出较低的分数,对经常出现在给定文档中的单词给出较高的分数。

You can see a worksheet of the calculations here: 您可以在此处查看计算的工作表:

https://docs.google.com/spreadsheet/ccc?key=0AreO9JhY28gcdFMtUFJrc0dRdkpiUWlhNHVGS1h5Y2c&usp=sharing https://docs.google.com/spreadsheet/ccc?key=0AreO9JhY28gcdFMtUFJrc0dRdkpiUWlhNHVGS1h5Y2c&usp=sharing

(switch to TFIDF tab at bottom) (切换到底部的TFIDF标签)

Here is a python library: 这是一个python库:

https://github.com/hrs/python-tf-idf https://github.com/hrs/python-tf-idf

I think the most accurate way that still maintains a semblance of simplicity would be to count the word frequencies in your source, then weight them according to their frequencies in common English (or whatever other language) usage. 我认为仍然保持简洁外观的最准确的方法是计算源中的单词频率,然后根据它们在普通英语(或任何其他语言)中的使用频率对它们进行加权。

Words that appear less frequently in common use, like "coffeehouse" are more likely to be a keyword than words that appear more often, like "dog." 像“咖啡馆”这样经常出现频率较低的单词更像是一个关键字,而不是像“狗”这样经常出现的单词。 Still, if your source mentions "dog" 500 times and "coffeehouse" twice it's more likely that "dog" is a keyword even though it's a common word. 尽管如此,如果你的消息来源提到500次“dog”和“coffeehouse”两次,那么“dog”更有可能是关键词,即使它是一个常用词。

Deciding on the weighting scheme would be the difficult part. 决定加权计划将是困难的部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM