简体   繁体   English

Lucene:从索引中挖掘电子邮件地址,名称和标识符

[英]Lucene: Mining email addresses, names, and identifiers from an index

I have a lucene index with approx. 我的lucene指数约为 1 million documents. 一百万份文件。 From these documents, I want to mine 我想从这些文件中挖掘

  1. email addresses 电子邮件地址
  2. signatures - ( [whitespace]/s/[whitespace]john doe[whitespace] ) 签名-( [whitespace] / s / [whitespace] john doe [whitespace]
  3. specific identifiers from each of the documents (that follow a regex pattern "\\s[0-9]{3}[a-zA-Z0-9]{6}\\s" ). 每个文档中的特定标识符(遵循正则表达式“ \\ s [0-9] {3} [a-zA-Z0-9] {6} \\ s” )。

I understand that ideally using solr, during index build time, its much easier, but how can one do this from a built lucene index? 我了解理想情况下,在索引构建期间使用solr会容易得多,但是如何从构建的Lucene索引中做到这一点呢?

I am using java. 我正在使用Java。 For email address search, I tried to .setAllowLeadingWildcard(true) and then searched for @ to find all email addresses - but I actually got zero results . 对于电子邮件地址搜索,我尝试.setAllowLeadingWildcard(true) ,然后搜索@以查找所有电子邮件地址-但实际上却得到零结果。 if I search for @ in luke I get zero results. 如果我在卢克中搜索@ ,则会得到零结果。 If I search for @hotmail.com in luke, I get bunch of results with valid email addresses such as aaaaa@hotmail.com. 如果我在luke中搜索@ hotmail.com ,则会得到带有有效电子邮件地址(例如aaaaa@hotmail.com)的搜索结果。

The index was created using StandardAnalyzer . 该索引是使用StandardAnalyzer创建的。 Not sure if it matters, but the text is in UTF-8 I believe. 不确定是否重要,但是我相信文字是UTF-8。

Any helpful suggestions, pointers is great! 任何有用的建议,指针都很棒! Note this is not for front end, so query doesn't have to be near realtime. 请注意,这不是针对前端的,因此查询不必是实时的。

Analysis does matter, yes. 分析确实很重要,是的。 The standard analyzer will treat whitespace and punctuation, such as @, as a place to split input into tokens. 标准分析器会将空格和标点符号(例如@)视为将输入拆分为标记的地方。 As such, you wouldn't expect to see any of them actually present in the indexed data. 因此,您不会期望在索引数据中实际看到它们中的任何一个。

You can use Lucene's regex query, particularly for the third case. 您可以使用Lucene的正则表达式查询,尤其是在第三种情况下。 A PhraseQuery seems appropriate for the second, I think, though I'm more that slightly confused about what you are trying to accomplish there. 我认为,PhraseQuery似乎适合第二个,尽管我对您要在此处完成的工作感到有些困惑。

Generally, you might want to use a different analyzer for an email field, in order to use it as a single token. 通常,您可能希望对电子邮件字段使用其他分析器,以便将其用作单个令牌。 You should get reasonable results searching for a particular e-mail address, since, though the analyzer would remove the punctuation, searching for the three (usually) tokens of a email consecutively in a phrase would be expected to get good matches. 您应该获得合理的搜索特定电子邮件地址的结果,因为尽管分析器会删除标点符号,但希望连续在短语中搜索电子邮件的三个(通常)标记会获得良好的匹配。 However, a regex search like \\w*@\\w*\\.\\w* , won't be particularly effective, since the punctuation won't actually be indexed and searchable, and a regex search doesn't span multiple terms in the index. 但是,像\\w*@\\w*\\.\\w*这样的正则表达式搜索并不是特别有效,因为标点实际上不会被索引和可搜索,并且正则表达式搜索不包含指数。 Apart from searching for a known set of e-mail domains, or something of that nature, you would want to re-index use analysis more in line with how you need to search it in order to do what you are asking. 除了搜索一组已知的或类似性质的电子邮件域外,您还希望根据使用搜索的方式对使用分析进行重新索引,以进行所需的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM