[英]indexing content of weburl into elasticsearch/kibana
I have scrapped 500+ links/sublinks of a website using beautiful soup+python,now I am looking forward to index all the contents/text of this url in elasticsearch,is there any tool that can help me indexing directly with elastic search/kibana stack. 我已经使用漂亮的汤+ python废弃了一个网站的500多个链接/子链接,现在我期待在Elasticsearch中对该URL的所有内容/文本建立索引,是否有任何工具可以帮助我直接在Elastic Search / Kibana中建立索引堆。
please help me with pointers,i tried searching on google and found logstash,but seems it works for single url. 请帮助我的指针,我试图在谷歌搜索并发现logstash,但似乎它适用于单个URL。
For reference on Logstash please see: https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html 有关Logstash的参考,请参阅: https ://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html
Otherwise, an example of putting your crawler output into a file, with a line per url, you could have the logstash config below, in this example, logstash will read one line as being a message and send it to the elastic servers on host1 and host2. 否则,在将爬虫程序输出放入文件(每个网址一行)的示例中,您可以在下面进行logstash配置,在此示例中,logstash将读取一行作为消息,并将其发送到host1上的弹性服务器,然后主机2。
input {
file {
path => "/an/absolute/path" #The path has to be absolute
start_position => beginning
}
}
output {
elasticsearch{
hosts => ["host1:port1", "host2:port2"] #most of the time the host being the DNS name (localhost as the most basic one), the port is 9200
index => "my_crawler_urls"
workers => 4 #to define depending on your available resources/expected performance
}
}
Now of course, you might want to do some filter, post-treatment of the output of your crawler, and for that Logstash gives you the possibility with codecs and/or filters 当然,现在,您可能需要做一些过滤器,对爬虫的输出进行后处理,为此,Logstash使您可以使用编解码器和/或过滤器
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.