簡體 English 中英

如何將抓取的“html”從 Apache Nutch 索引到 Solr？

[英]How to index crawled "html" from Apache Nutch to Solr?

原文 2020-11-19 13:57:16 3 1 html/ indexing/ solr/ nutch

我想將 Apache Nutch (v1.17) 抓取的網頁的源代碼索引到 Solr (8.6.3) 中的索引，但不知道該怎么做？ 至少我只是得到一個准備好的版本，索引到 Solr內容（見下文）。

{
  "tstamp":"2020-11-19T08:41:15.908Z",
  "digest":"fdc7532e799d4a3a434be4be67c36bb3b",
  "boost":1.0,
  .
  .
  .
  "content":"Algorithm Engineering Group ....",
 "_version_":16837969286885539843
}

我已經看過index-writers.xml ，但我仍然不知道該怎么做。 也許你知道怎么做。

1 個解決方案

Nutch 索引工具提供了一個命令行選項來索引網頁的原始內容：

$> bin/nutch index
...
-addBinaryContent  index raw/binary content in field `binaryContent`
-base64            use Base64 encoding for binary content
...

注意：請注意爬蟲可能訪問的 PDF 和其他二進制格式！

如何將HTML文件索引到Apache SOLR中？

[英]How do I index HTML files into Apache SOLR?

如何在 HTML 上打印 JavaScript 爬取的內容

[英]How to print contents crawled by JavaScript on HTML

在Solr中剝離，存儲和索引HTML文件

[英]Strip, store and index HTML files in Solr

如何使用DIH在Apache Slor中索引HTML文件

[英]How to index html files in apache slor using DIH

是圖像地圖（html標簽 - <map> ）谷歌爬行？

[英]Are image maps (html tag - <map>) crawled by google?

HTML和CSS網站無法正確爬網

[英]Html and css website not being crawled properly

防止bot / crawler爬網和緩存html頁面

[英]Preventing html page to be crawled and cached by bot/crawler

Scrapy - 如何索引和從 html 表中提取

[英]Scrapy - how to index and extract from html tables

如何從 html 元素中獲取數組索引？

[英]How to get array index from html element?

Solr從索引中排除HTML類

[英]solr exclude html class from indexing

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 如何將HTML文件索引到Apache SOLR中？如何在 HTML 上打印 JavaScript 爬取的內容在Solr中剝離，存儲和索引HTML文件如何使用DIH在Apache Slor中索引HTML文件是圖像地圖（html標簽 - <map> ）谷歌爬行？ HTML和CSS網站無法正確爬網防止bot / crawler爬網和緩存html頁面 Scrapy - 如何索引和從 html 表中提取如何從 html 元素中獲取數組索引？ Solr從索引中排除HTML類

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM