简体   繁体   English

Lucene索引HTML标识符文件

[英]Lucene index HTML File with Identifiers

first of all i wanna apologize for my poor english. 首先,我要为我的英语不好道歉。 I have a html file named cacm.html with plenty of documents each document is structurated like this: 我有一个名为cacm.html的html文件,其中包含大量文档,每个文档的结构如下:

.I indicates article identifier
.T idicates article title
.A indicates article authors
.W indicates article resume
.X indicates article references

Here's an example of an Article: 这是文章的示例:

.I 20
.T
Accelerating Convergence of Iterative Processes
.W
A technique is discussed which, when applied
to an iterative procedure for the solution of
an equation, accelerates the rate of convergence if
the iteration converges and induces convergence if
the iteration diverges.  An illustrative example is given.
.B
CACM June, 1958
.A
Wegstein, J. H.
.N
CA580602 JB March 22, 1978  9:09 PM
.X
20  5   20
20  5   20
20  5   20

i wrote this code: 我写了这段代码:

    //IMPORTS

Public class in{
    public static void main(String[] args) throws IOException{  
    Path p = Paths.get("C:\\Users\\pc\\Desktop\\indexationeclipc", args);       
    StandardAnalyzer analyzer = new StandardAnalyzer();
    Directory directory = FSDirectory.open(p);
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);

    BufferedReader br = new BufferedReader(new FileReader("C:\\Users\\pc\\Desktop\\index\\cacm.htm"));

    boolean lire=false;

    String[] fields = new String[5];
    for (int i = 0; i < fields.length; i++) {
        fields[i] = "";
    }
    int fieldno = 0;

    String line=br.readLine();
    while(line!=null) {
        if(line.startsWith(".I")) {


            String[] parts = line.split(" ");
            fields[0] = parts[1];
            fieldno = 0;



            if (!fields[0].equals("")) {
               Document doc=new Document();
               Field I= new StringField("I",fields[0],Field.Store.YES);
               doc.add(I);

               Field T= new StringField("T",fields[1],Field.Store.YES);
               doc.add(T);

               Field A= new StringField("A",fields[2],Field.Store.YES);
               doc.add(A);

               Field W= new TextField("W",fields[3],Field.Store.YES);
               doc.add(W);

               Field X= new TextField("X",fields[4],Field.Store.YES);
               doc.add(X);

               iwriter.addDocument(doc);

            }

             for (int i = 0; i < fields.length; i++) {
                 fields[i] = "";
             }


        }


        else if(line.startsWith(".T")) {
            lire=true;
            fieldno = 1;

        }

        else if(line.startsWith(".A")) {
            lire=true;
            fieldno = 2;
        }

        else if(line.startsWith(".W")) {
            lire=true;
            fieldno = 3;
        }

        else if(line.startsWith(".X")) {
            lire=true;
            fieldno = 4;
        }

        else if(line.startsWith(".")) {
            lire=false;
        }

        if((fieldno > 0) && (fieldno < 5)) {
            if(lire==true) {
            if (line.length() > 2) {
                fields[fieldno] += " " + line;
            }}
        }

        line = br.readLine();       

    }


br.close();
iwriter.close();


    }       
}

But the indexation isnt finishing, the indexation stops and its not all the terms that get indexed and he indexed like a thousand time the same word and some times he indexes WHOLE Phrases not terms only :( x)) 但是索引还没有完成,索引也停止了,索引的所有词项也没有全部停止,他像同一单词的千倍索引一样,有时他只对整个词组进行了索引:( x))

This does not seams to be an indexing or lucene problem. 这并非是索引或Lucene问题。 I think there's a problem how you read this data and split it. 我认为如何读取和拆分数据存在问题。

i would recommend using apache tika for html data extraction. 我建议使用apache tika进行html数据提取。 It's pretty useful. 非常有用 See https://tika.apache.org/ 参见https://tika.apache.org/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM