[英]lucene indexing of html files
亲爱的用户我正在使用apache lucene进行索引和搜索。 我必须索引存储在计算机本地光盘上的html文件。 我必须对html文件的文件名和内容进行索引。 我能够将文件名存储在lucene索引中但不能存储html文件内容,这些内容不仅应该索引数据,而且应该整个页面包含图像链接和url以及如何从索引文件中访问内容以进行索引我正在使用以下代码:
File indexDir = new File(indexpath);
File dataDir = new File(datapath);
String suffix = ".htm";
IndexWriter indexWriter = new IndexWriter(
FSDirectory.open(indexDir),
new SimpleAnalyzer(),
true,
IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
indexDirectory(indexWriter, dataDir, suffix);
numIndexed = indexWriter.maxDoc();
indexWriter.optimize();
indexWriter.close();
private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException {
try {
for (File f : dataDir.listFiles()) {
if (f.isDirectory()) {
indexDirectory(indexWriter, f, suffix);
} else {
indexFileWithIndexWriter(indexWriter, f, suffix);
}
}
} catch (Exception ex) {
System.out.println("exception 2 is" + ex);
}
}
private void indexFileWithIndexWriter(IndexWriter indexWriter, File f,
String suffix) throws IOException {
try {
if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {
return;
}
if (suffix != null && !f.getName().endsWith(suffix)) {
return;
}
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getFileName(),
Field.Store.YES, Field.Index.ANALYZED));
indexWriter.addDocument(doc);
} catch (Exception ex) {
System.out.println("exception 4 is" + ex);
}
}
提前致谢
这行代码是您的内容未被存储的原因:
doc.add(new Field("contents", new FileReader(f)));
此方法不会存储被索引的内容。
如果您尝试索引HTML文件,请尝试使用JTidy 。 它将使这个过程更容易。
示例代码:
public class JTidyHTMLHandler {
public org.apache.lucene.document.Document getDocument(InputStream is) throws DocumentHandlerException {
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
org.w3c.dom.Document root = tidy.parseDOM(is, null);
Element rawDoc = root.getDocumentElement();
org.apache.lucene.document.Document doc =
new org.apache.lucene.document.Document();
String body = getBody(rawDoc);
if ((body != null) && (!body.equals(""))) {
doc.add(new Field("contents", body, Field.Store.NO, Field.Index.ANALYZED));
}
return doc;
}
protected String getTitle(Element rawDoc) {
if (rawDoc == null) {
return null;
}
String title = "";
NodeList children = rawDoc.getElementsByTagName("title");
if (children.getLength() > 0) {
Element titleElement = ((Element) children.item(0));
Text text = (Text) titleElement.getFirstChild();
if (text != null) {
title = text.getData();
}
}
return title;
}
protected String getBody(Element rawDoc) {
if (rawDoc == null) {
return null;
}
String body = "";
NodeList children = rawDoc.getElementsByTagName("body");
if (children.getLength() > 0) {
body = getText(children.item(0));
}
return body;
}
protected String getText(Node node) {
NodeList children = node.getChildNodes();
StringBuffer sb = new StringBuffer();
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
switch (child.getNodeType()) {
case Node.ELEMENT_NODE:
sb.append(getText(child));
sb.append(" ");
break;
case Node.TEXT_NODE:
sb.append(((Text) child).getData());
break;
}
}
return sb.toString();
}
}
要从URL获取InputStream:
URL url = new URL(htmlURLlocation);
URLConnection connection = url.openConnection();
InputStream stream = connection.getInputStream();
从文件中获取InputStream:
InputStream stream = new FileInputStream(new File (htmlFile));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.