简体繁体 English

Lucene：从索引中挖掘电子邮件地址，名称和标识符

[英]Lucene: Mining email addresses, names, and identifiers from an index

原文 2013-11-12 20:38:15 6 1 java/ regex/ lucene

I have a lucene index with approx. 我的lucene指数约为 1 million documents. 一百万份文件。 From these documents, I want to mine 我想从这些文件中挖掘

email addresses 电子邮件地址
signatures - ( [whitespace]/s/[whitespace]john doe[whitespace] ) 签名-（ [whitespace] / s / [whitespace] john doe [whitespace] ）
specific identifiers from each of the documents (that follow a regex pattern "\\s[0-9]{3}[a-zA-Z0-9]{6}\\s" ). 每个文档中的特定标识符（遵循正则表达式“ \\ s [0-9] {3} [a-zA-Z0-9] {6} \\ s” ）。

I understand that ideally using solr, during index build time, its much easier, but how can one do this from a built lucene index? 我了解理想情况下，在索引构建期间使用solr会容易得多，但是如何从构建的Lucene索引中做到这一点呢？

I am using java. 我正在使用Java。 For email address search, I tried to .setAllowLeadingWildcard(true) and then searched for @ to find all email addresses - but I actually got zero results . 对于电子邮件地址搜索，我尝试.setAllowLeadingWildcard（true） ，然后搜索@以查找所有电子邮件地址-但实际上却得到零结果。 if I search for @ in luke I get zero results. 如果我在卢克中搜索@ ，则会得到零结果。 If I search for @hotmail.com in luke, I get bunch of results with valid email addresses such as aaaaa@hotmail.com. 如果我在luke中搜索@ hotmail.com ，则会得到带有有效电子邮件地址（例如aaaaa@hotmail.com）的搜索结果。

The index was created using StandardAnalyzer . 该索引是使用StandardAnalyzer创建的。 Not sure if it matters, but the text is in UTF-8 I believe. 不确定是否重要，但是我相信文字是UTF-8。

Any helpful suggestions, pointers is great! 任何有用的建议，指针都很棒！ Note this is not for front end, so query doesn't have to be near realtime. 请注意，这不是针对前端的，因此查询不必是实时的。

1 个解决方案

Analysis does matter, yes. 分析确实很重要，是的。 The standard analyzer will treat whitespace and punctuation, such as @, as a place to split input into tokens. 标准分析器会将空格和标点符号（例如@）视为将输入拆分为标记的地方。 As such, you wouldn't expect to see any of them actually present in the indexed data. 因此，您不会期望在索引数据中实际看到它们中的任何一个。

You can use Lucene's regex query, particularly for the third case. 您可以使用Lucene的正则表达式查询，尤其是在第三种情况下。 A PhraseQuery seems appropriate for the second, I think, though I'm more that slightly confused about what you are trying to accomplish there. 我认为，PhraseQuery似乎适合第二个，尽管我对您要在此处完成的工作感到有些困惑。

Generally, you might want to use a different analyzer for an email field, in order to use it as a single token. 通常，您可能希望对电子邮件字段使用其他分析器，以便将其用作单个令牌。 You should get reasonable results searching for a particular e-mail address, since, though the analyzer would remove the punctuation, searching for the three (usually) tokens of a email consecutively in a phrase would be expected to get good matches. 您应该获得合理的搜索特定电子邮件地址的结果，因为尽管分析器会删除标点符号，但希望连续在短语中搜索电子邮件的三个（通常）标记会获得良好的匹配。 However, a regex search like \\w*@\\w*\\.\\w* , won't be particularly effective, since the punctuation won't actually be indexed and searchable, and a regex search doesn't span multiple terms in the index. 但是，像\\w*@\\w*\\.\\w*这样的正则表达式搜索并不是特别有效，因为标点实际上不会被索引和可搜索，并且正则表达式搜索不包含指数。 Apart from searching for a known set of e-mail domains, or something of that nature, you would want to re-index use analysis more in line with how you need to search it in order to do what you are asking. 除了搜索一组已知的或类似性质的电子邮件域外，您还希望根据使用搜索的方式对使用分析进行重新索引，以进行所需的操作。