简体   繁体   English

将weburl的内容索引到elasticsearch / kibana中

[英]indexing content of weburl into elasticsearch/kibana

I have scrapped 500+ links/sublinks of a website using beautiful soup+python,now I am looking forward to index all the contents/text of this url in elasticsearch,is there any tool that can help me indexing directly with elastic search/kibana stack. 我已经使用漂亮的汤+ python废弃了一个网站的500多个链接/子链接,现在我期待在Elasticsearch中对该URL的所有内容/文本建立索引,是否有任何工具可以帮助我直接在Elastic Search / Kibana中建立索引堆。

please help me with pointers,i tried searching on google and found logstash,but seems it works for single url. 请帮助我的指针,我试图在谷歌搜索并发现logstash,但似乎它适用于单个URL。

For reference on Logstash please see: https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html 有关Logstash的参考,请参阅: https ://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html

Otherwise, an example of putting your crawler output into a file, with a line per url, you could have the logstash config below, in this example, logstash will read one line as being a message and send it to the elastic servers on host1 and host2. 否则,在将爬虫程序输出放入文件(每个网址一行)的示例中,您可以在下面进行logstash配置,在此示例中,logstash将读取一行作为消息,并将其发送到host1上的弹性服务器,然后主机2。

input {
    file {
        path => "/an/absolute/path" #The path has to be absolute
        start_position => beginning
     }
}

output {
    elasticsearch{
        hosts => ["host1:port1", "host2:port2"] #most of the time the host being the DNS name (localhost as the most basic one), the port is 9200
        index => "my_crawler_urls"
        workers => 4 #to define depending on your available resources/expected performance
    }
}

Now of course, you might want to do some filter, post-treatment of the output of your crawler, and for that Logstash gives you the possibility with codecs and/or filters 当然,现在,您可能需要做一些过滤器,对爬虫的输出进行后处理,为此,Logstash使您可以使用编解码器和/或过滤器

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM