简体   繁体   中英

Lucene: Mining email addresses, names, and identifiers from an index

I have a lucene index with approx. 1 million documents. From these documents, I want to mine

  1. email addresses
  2. signatures - ( [whitespace]/s/[whitespace]john doe[whitespace] )
  3. specific identifiers from each of the documents (that follow a regex pattern "\\s[0-9]{3}[a-zA-Z0-9]{6}\\s" ).

I understand that ideally using solr, during index build time, its much easier, but how can one do this from a built lucene index?

I am using java. For email address search, I tried to .setAllowLeadingWildcard(true) and then searched for @ to find all email addresses - but I actually got zero results . if I search for @ in luke I get zero results. If I search for @hotmail.com in luke, I get bunch of results with valid email addresses such as aaaaa@hotmail.com.

The index was created using StandardAnalyzer . Not sure if it matters, but the text is in UTF-8 I believe.

Any helpful suggestions, pointers is great! Note this is not for front end, so query doesn't have to be near realtime.

Analysis does matter, yes. The standard analyzer will treat whitespace and punctuation, such as @, as a place to split input into tokens. As such, you wouldn't expect to see any of them actually present in the indexed data.

You can use Lucene's regex query, particularly for the third case. A PhraseQuery seems appropriate for the second, I think, though I'm more that slightly confused about what you are trying to accomplish there.

Generally, you might want to use a different analyzer for an email field, in order to use it as a single token. You should get reasonable results searching for a particular e-mail address, since, though the analyzer would remove the punctuation, searching for the three (usually) tokens of a email consecutively in a phrase would be expected to get good matches. However, a regex search like \\w*@\\w*\\.\\w* , won't be particularly effective, since the punctuation won't actually be indexed and searchable, and a regex search doesn't span multiple terms in the index. Apart from searching for a known set of e-mail domains, or something of that nature, you would want to re-index use analysis more in line with how you need to search it in order to do what you are asking.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM