Lucene索引HTML标识符文件

Question

first of all i wanna apologize for my poor english. 首先，我要为我的英语不好道歉。 I have a html file named cacm.html with plenty of documents each document is structurated like this: 我有一个名为cacm.html的html文件，其中包含大量文档，每个文档的结构如下：

.I indicates article identifier
.T idicates article title
.A indicates article authors
.W indicates article resume
.X indicates article references

Here's an example of an Article: 这是文章的示例：

.I 20
.T
Accelerating Convergence of Iterative Processes
.W
A technique is discussed which, when applied
to an iterative procedure for the solution of
an equation, accelerates the rate of convergence if
the iteration converges and induces convergence if
the iteration diverges.  An illustrative example is given.
.B
CACM June, 1958
.A
Wegstein, J. H.
.N
CA580602 JB March 22, 1978  9:09 PM
.X
20  5   20
20  5   20
20  5   20

i wrote this code: 我写了这段代码：

    //IMPORTS

Public class in{
    public static void main(String[] args) throws IOException{  
    Path p = Paths.get("C:\\Users\\pc\\Desktop\\indexationeclipc", args);       
    StandardAnalyzer analyzer = new StandardAnalyzer();
    Directory directory = FSDirectory.open(p);
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);

    BufferedReader br = new BufferedReader(new FileReader("C:\\Users\\pc\\Desktop\\index\\cacm.htm"));

    boolean lire=false;

    String[] fields = new String[5];
    for (int i = 0; i < fields.length; i++) {
        fields[i] = "";
    }
    int fieldno = 0;

    String line=br.readLine();
    while(line!=null) {
        if(line.startsWith(".I")) {


            String[] parts = line.split(" ");
            fields[0] = parts[1];
            fieldno = 0;



            if (!fields[0].equals("")) {
               Document doc=new Document();
               Field I= new StringField("I",fields[0],Field.Store.YES);
               doc.add(I);

               Field T= new StringField("T",fields[1],Field.Store.YES);
               doc.add(T);

               Field A= new StringField("A",fields[2],Field.Store.YES);
               doc.add(A);

               Field W= new TextField("W",fields[3],Field.Store.YES);
               doc.add(W);

               Field X= new TextField("X",fields[4],Field.Store.YES);
               doc.add(X);

               iwriter.addDocument(doc);

            }

             for (int i = 0; i < fields.length; i++) {
                 fields[i] = "";
             }


        }


        else if(line.startsWith(".T")) {
            lire=true;
            fieldno = 1;

        }

        else if(line.startsWith(".A")) {
            lire=true;
            fieldno = 2;
        }

        else if(line.startsWith(".W")) {
            lire=true;
            fieldno = 3;
        }

        else if(line.startsWith(".X")) {
            lire=true;
            fieldno = 4;
        }

        else if(line.startsWith(".")) {
            lire=false;
        }

        if((fieldno > 0) && (fieldno < 5)) {
            if(lire==true) {
            if (line.length() > 2) {
                fields[fieldno] += " " + line;
            }}
        }

        line = br.readLine();       

    }


br.close();
iwriter.close();


    }       
}

But the indexation isnt finishing, the indexation stops and its not all the terms that get indexed and he indexed like a thousand time the same word and some times he indexes WHOLE Phrases not terms only :( x)) 但是索引还没有完成，索引也停止了，索引的所有词项也没有全部停止，他像同一单词的千倍索引一样，有时他只对整个词组进行了索引：( x））

Answer 1

This does not seams to be an indexing or lucene problem. 这并非是索引或Lucene问题。 I think there's a problem how you read this data and split it. 我认为如何读取和拆分数据存在问题。

i would recommend using apache tika for html data extraction. 我建议使用apache tika进行html数据提取。 It's pretty useful. 非常有用 See https://tika.apache.org/ 参见https://tika.apache.org/

Lucene索引HTML标识符文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-05-17 10:49:12

Lucene索引HTML标识符文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-05-17 10:49:12

解决方案1
0 已采纳 2018-05-17 10:49:12