简体   繁体   English


[英]lucene indexing of html files

Dear Users I am working on apache lucene for indexing and searching . 亲爱的用户我正在使用apache lucene进行索引和搜索。 I have to index html files stored on the local disc of computer . 我必须索引存储在计算机本地光盘上的html文件。 I have to make indexing on filename and contents of the html files . 我必须对html文件的文件名和内容进行索引。 I am able to store the file names in the lucene index but not the html file contents which should index not only the data but the entire page consisting images link and url and how can i access the contents from those indexed files for indexing i am using the following code: 我能够将文件名存储在lucene索引中但不能存储html文件内容,这些内容不仅应该索引数据,而且应该整个页面包含图像链接和url以及如何从索引文件中访问内容以进行索引我正在使用以下代码:

    File indexDir = new File(indexpath);
    File dataDir = new File(datapath);
    String suffix = ".htm";
    IndexWriter indexWriter = new IndexWriter(
            new SimpleAnalyzer(),
    indexDirectory(indexWriter, dataDir, suffix);

    numIndexed = indexWriter.maxDoc();

private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException {
    try {
        for (File f : dataDir.listFiles()) {
            if (f.isDirectory()) {
                indexDirectory(indexWriter, f, suffix);
            } else {
                indexFileWithIndexWriter(indexWriter, f, suffix);
    } catch (Exception ex) {
        System.out.println("exception 2 is" + ex);

private void indexFileWithIndexWriter(IndexWriter indexWriter, File f,
    String suffix) throws IOException {
    try {
        if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {
        if (suffix != null && !f.getName().endsWith(suffix)) {
        Document doc = new Document();
        doc.add(new Field("contents", new FileReader(f)));
        doc.add(new Field("filename", f.getFileName(),
                Field.Store.YES, Field.Index.ANALYZED));
    } catch (Exception ex) {
        System.out.println("exception 4 is" + ex);

thanks in advance 提前致谢

This line of code is the reason why your contents is not being stored: 这行代码是您的内容未被存储的原因:

doc.add(new Field("contents", new FileReader(f)));

This method DOES NOT STORE the contents being indexed. 此方法不会存储被索引的内容。

If you are trying to index HTML files, try using JTidy . 如果您尝试索引HTML文件,请尝试使用JTidy It will make the process much easier. 它将使这个过程更容易。

Sample Codes: 示例代码:

public class JTidyHTMLHandler {

    public org.apache.lucene.document.Document getDocument(InputStream is) throws DocumentHandlerException {
        Tidy tidy = new Tidy();
        org.w3c.dom.Document root = tidy.parseDOM(is, null);
        Element rawDoc = root.getDocumentElement();

        org.apache.lucene.document.Document doc =
                new org.apache.lucene.document.Document();

        String body = getBody(rawDoc);

        if ((body != null) && (!body.equals(""))) {
            doc.add(new Field("contents", body, Field.Store.NO, Field.Index.ANALYZED));

        return doc;

    protected String getTitle(Element rawDoc) {
        if (rawDoc == null) {
            return null;

        String title = "";

        NodeList children = rawDoc.getElementsByTagName("title");
        if (children.getLength() > 0) {
            Element titleElement = ((Element) children.item(0));
            Text text = (Text) titleElement.getFirstChild();
            if (text != null) {
                title = text.getData();
        return title;

    protected String getBody(Element rawDoc) {
        if (rawDoc == null) {
            return null;

        String body = "";
        NodeList children = rawDoc.getElementsByTagName("body");
        if (children.getLength() > 0) {
            body = getText(children.item(0));
        return body;

    protected String getText(Node node) {
        NodeList children = node.getChildNodes();
        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < children.getLength(); i++) {
            Node child = children.item(i);
            switch (child.getNodeType()) {
                case Node.ELEMENT_NODE:
                    sb.append(" ");
                case Node.TEXT_NODE:
                    sb.append(((Text) child).getData());
        return sb.toString();

To get an InputStream from a URL: 要从URL获取InputStream:

URL url = new URL(htmlURLlocation);
URLConnection connection = url.openConnection();
InputStream stream = connection.getInputStream();

To get an InputStream from a File: 从文件中获取InputStream:

InputStream stream = new FileInputStream(new File (htmlFile));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM