简体   繁体   English

如何在Lucene索引中搜索俄语文本?

[英]How to search russian texts in Lucene index?

I can not understand where I was wrong. 我不明白我哪里错了。 My code, where "/ home/test/03m8894---20070213134234.txt" - file with the English text, and "/ home/test/01---20061121103506.txt" - file with the Russian text. 我的代码,其中“ / home / test / 03m8894 --- 20070213134234.txt”-带有英文文本的文件,“ / home / test / 01 --- 20061121103506.txt”-带有俄语文本的文件。 Both files are encoded in UTF-8. 这两个文件均以UTF-8编码。 The result of the execution of the program: 1 0 Ie the program finds only text in English and Russian text ignored. 程序执行的结果:1 0即程序仅找到英文和俄文的文本而被忽略。 Although if you do 虽然如果你这样做

            for (int m = 0; m <totalDocs; m + +) { 

            Document thisDoc = reader.document (m); 
            System.out.print (thisDoc.get ("partnum")); 

the text field partnum correctly, no errors in the output encoding on the screen. 文本字段partnum正确,屏幕上的输出编码中没有错误。

 RAMDirectory directory = new RAMDirectory();

        IndexWriter writer =
                //new IndexWriter(directory, new SimpleAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
                new IndexWriter(directory, new RussianAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED);
        File f1[] = {new File("/home/test/03m8894---20070213134234.txt"), new File("/home/test/01---20061121103506.txt")};

        String strLine1 = "";
        for (int x = 0; x < f1.length; x++) {
            Document doc = new Document();
            int length = (int) f1[x].length();
            if (length != 0) {
                char[] cbuf = new char[length];
                InputStreamReader isr = new InputStreamReader(new FileInputStream(f1[x]));
                final int read = isr.read(cbuf);
                strLine1 = new String(cbuf, 0, read);
                isr.close();
                doc.add(new Field("partnum", strLine1, Field.Store.YES, Field.Index.NOT_ANALYZED));
                //doc.add(new Field("description", "Illidium Space Modulator", Field.Store.YES, Field.Index.ANALYZED));
                writer.addDocument(doc);
            }

        }
writer.close();

        IndexSearcher searcher = new IndexSearcher(directory);
        IndexReader reader = searcher.getIndexReader();
        int totalDocs = reader.numDocs();



        for (int m = 0; m < totalDocs; m++) {

            Document thisDoc = reader.document(m);

     String tmp_str=thisDoc.get("partnum");

     Query query = new TermQuery(new Term("partnum", tmp_str));

     TopDocs rs = searcher.search(query, null, 10);
            System.out.println(rs.totalHits);

You said the files are UTF-8 encoded, but you use: 您说文件是UTF-8编码的,但是您使用:


InputStreamReader isr = new InputStreamReader(new FileInputStream(f1[x]));

This relies on the default encoding, which is likely not UTF-8. 这依赖于默认编码,可能不是UTF-8。 try: 尝试:


InputStreamReader isr = new InputStreamReader(new FileInputStream(f1[x]), "UTF-8");

这并不重要,因为UTF-8中的所有文件以及默认系统和文件编码均为UTF-8 ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM