简体   繁体   English

ElasticSearch + Kibana显示业务数据

[英]ElasticSearch + Kibana to display business data

So I have visitor data captured for the past several years - over 14 million records. 因此,我收集了过去几年的访客数据-超过1400万条记录。 On top of that I have form data from the past several years. 最重要的是,我拥有过去几年的表格数据。 There is a common ID between the two. 两者之间有一个公共ID。

Right now I'm attempting to learn ElasticSearch + Kibana using the visitor data. 现在,我正在尝试使用访问者数据来学习ElasticSearch + Kibana。 The data is fairly simple but not real well formatted - PHP's $_REQUEST and $_SERVER data. 数据相当简单,但格式却不正确-PHP的$ _REQUEST和$ _SERVER数据。 Here's an example from a Google bot visit: 这是来自Google机器人访问的示例:

{u'Entrance Time': 1407551587.7385,
 u'domain': u'############',
 u'pages': {u'6818555600ccd9880bf7acef228c5d47': {u'REQUEST': [],
   u'SERVER': {u'DOCUMENT_ROOT': u'/var/www/####/',
    u'Entrance Time': 1407551587.7385,
    u'GATEWAY_INTERFACE': u'CGI/1.1',
    u'HTTP_ACCEPT': u'*/*',
    u'HTTP_ACCEPT_ENCODING': u'gzip,deflate',
    u'HTTP_CONNECTION': u'Keep-alive',
    u'HTTP_FROM': u'googlebot(at)googlebot.com',
    u'HTTP_HOST': u'############',
    u'HTTP_IF_MODIFIED_SINCE': u'Fri, 13 Jun 2014 20:26:33 GMT',
    u'HTTP_USER_AGENT': u'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
    u'PATH': u'/usr/local/bin:/usr/bin:/bin',
    u'PHP_SELF': u'/index.php',
    u'QUERY_STRING': u'',
    u'REDIRECT_SCRIPT_URI': u'http://############/',
    u'REDIRECT_SCRIPT_URL': u'############',
    u'REDIRECT_STATUS': u'200',
    u'REDIRECT_URL': u'############',
    u'REMOTE_ADDR': u'############',
    u'REMOTE_PORT': u'46271',
    u'REQUEST_METHOD': u'GET',
    u'REQUEST_TIME': u'1407551587',
    u'REQUEST_URI': u'############',
    u'SCRIPT_FILENAME': u'/var/www/PIAN/index.php',
    u'SCRIPT_NAME': u'/index.php',
    u'SCRIPT_URI': u'http://############/',
    u'SCRIPT_URL': u'/############/',
    u'SERVER_ADDR': u'############',
    u'SERVER_ADMIN': u'admin@############',
    u'SERVER_NAME': u'############',
    u'SERVER_PORT': u'80',
    u'SERVER_PROTOCOL': u'HTTP/1.1',
    u'SERVER_SIGNATURE': u'<address>Apache/2.2.22 (Ubuntu) Server at ############ Port 80</address>\n',
    u'SERVER_SOFTWARE': u'Apache/2.2.22 (Ubuntu)',
    u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'},
   u'SESSION': {u'Entrance Time': 1407551587.7385,
    u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'}}},
 u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'}

I use the Python package elasticsearch.py as my interface. 我使用Python包elasticsearch.py​​作为界面。 I create my index like this: 我这样创建索引:

es.indices.create(
    index=Visit_to_ElasticSearch.INDEX,
    body={
        'settings': {
            'number_of_shards': 5,
            'number_of_replicas': 1,
        }
    },
    # ignore already existing index
    ignore=400
)

And this is my mapping: 这是我的映射:

# Create mappings of a visit
time_date_mapping = { 'type': 'date_time' }
str_not_analyzed = { 'type': 'string'} # This used to include 'index': 'not_analyzed'

visit_mapping = {
    'properties': {
        'uniqID': str_not_analyzed,
        'pages': str_not_analyzed,
        'domain': str_not_analyzed,
        'Srvr IP': str_not_analyzed,
        'Visitor IP': str_not_analyzed,
        'Agent': { 'type': 'string', 'index': 'not_analyzed' },
        'Referrer': { 'type': 'string' },
        'Entrance Time': time_date_mapping,
        'Request Time': time_date_mapping,
        'Raw': { 'type': 'string', 'index': 'not_analyzed' },
        'Pages': { 'type': 'string', 'index': 'not_analyzed' },
    },
}

The actual mapping that ES reports: ES报告的实际映射:

'visits': {
  'mappings': {
    'visit': {
      'properties': {
        'Agent': {'type': 'string'},
        'Entrance Time': {'format': 'dateOptionalTime', 'type': 'date'},
        'Pages': {'type': 'string'},
        'Raw': {
          'properties': {
            'Entrance Time': {'type': 'double'},
            'domain': {'type': 'string'},
            'uniqID': {'type': 'string'}
          }
        },
        'Referrer': {'type': 'string'},
        'Request Time': {'format': 'dateOptionalTime', 'type': 'date'},
        'Srvr IP': {'type': 'string'},
        'Visitor IP': {'type': 'string'},
        'domain': {'type': 'string'},
        'uniqID': {'type': 'string'}
      }
    }
  }
}

When I dump my trial data into ES and view it in Kibana4 there are problems. 当我将试用数据转储到ES中并在Kibana4中查看时,出现了问题。 From the Discover tab, it shows me a "Quick Count" of the top 5 Agents with a truncated version of the full string. 在“发现”选项卡上,它向我显示了前5个代理的“快速计数”,并带有完整字符串的截短版本。 However, when I create a visualization (Visualize->Pie Chart->From a new search->Split Slices) using Terms in Aggregation and Agetn in field I get the top 5 as a list of single words - the list is mozilla, 5.0, compatible, http, 2.0. 但是,当我使用“聚合中的术语”和“字段”中的“ Agetn”字段创建可视化(Visualize-> Pie Chart->来自新搜索-> Split Slices)时,我得到前5个单词的列表-该列表是mozilla,5.0 ,兼容,http,2.0。

Kibana warns me that the Agent field is being Analyzed despite my telling it not to analyze that field in the mapping. Kibana警告我,尽管我告诉它不要分析映射中的该字段,但正在分析Agent字段。

I'm brand new to this, am I incorrect in assuming that if Agent was not analyzed it would do counts on the full Agent string? 我对此是全新的,我是否认为如果不对Agent进行分析,这是否会完全依赖Agent字符串是不正确的吗? Replacing spaces with underscores did not fix this. 用下划线替换空格不能解决此问题。 So how do I fix this? 那么我该如何解决呢? Is there a way to put the Agent sting into ES such that it is only consider as a whole? 有没有一种方法可以将Agent刺入ES,使其仅作为一个整体来考虑?

Thank you 谢谢

Full mapping code can be found at this question . 在此问题上可以找到完整的映射代码。

------- Mapping after cURL -------- ------- cURL之后的映射--------

I used curl --request PUT 'http://127.0.0.1:9200/visits/_mapping/visit?ignore_conflicts=true' --data '{"visit" : { "properties" : { "Agent" : { "type" : "string", "index" : "not_analyzed" } } } }' to alter the mapping and this is the new mapping: 我使用curl --request PUT 'http://127.0.0.1:9200/visits/_mapping/visit?ignore_conflicts=true' --data '{"visit" : { "properties" : { "Agent" : { "type" : "string", "index" : "not_analyzed" } } } }'更改映射,这是新的映射:

{
  "visits" : {
    "mappings" : {
      "visit" : {
        "properties" : {
          "Agent" : {
            "type" : "string",
            "norms" : {
              "enabled" : false
            }
          },
          "Entrance Time" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "Pages" : {
            "type" : "string"
          },
          "Raw" : {
            "properties" : {
              "Entrance Time" : {
                "type" : "double"
              },
              "domain" : {
                "type" : "string"
              },
              "uniqID" : {
                "type" : "string"
              }
            }
          },
          "Referrer" : {
            "type" : "string"
          },
          "Request Time" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "Srvr IP" : {
            "type" : "string"
          },
          "Visitor IP" : {
            "type" : "string"
          },
          "domain" : {
            "type" : "string"
          },
          "uniqID" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

This is the same issue as this other issue and the reason it doesn't work has to do with the fact that the mapping visit_mapping was never installed via put_mapping . 这是与另一个问题相同的问题,不起作用的原因与映射visit_mapping从未通过put_mapping安装put_mapping Hence, ES has created his own mapping based on what's been sent in the visit document. 因此,ES根据visit文档中发送的内容创建了自己的映射。

To solve this, simply call put_mapping with your mapping before indexing your first visit document. 要解决此问题,只需在映射您的首次visit文档之前调用put_mapping即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM