简体   繁体   English

如何格式化德鲁伊语中的TSV文件

[英]How to format the TSV file in Druid

I am trying to load in a TSV in druid using this ingestion speck: 我正在尝试使用此摄取斑点加载德鲁伊的TSV:

MOST UPDATED SPEC BELOW: 最新版本如下:

{                                                                                                                                                                                               
"type" : "index",
"spec" : {
    "ioConfig" : {
        "type" : "index",
        "inputSpec" : {
            "type": "local",
            "baseDir": "quickstart",
            "filter": "test_data.json"
        }
    },
    "dataSchema" : {
        "dataSource" : "local",
        "granularitySpec" : {
            "type" : "uniform",
            "segmentGranularity" : "hour",
            "queryGranularity" : "none",
            "intervals" : ["2016-07-18/2016-07-22"]
        },
        "parser" : {
            "type" : "string",
            "parseSpec" : {
                "format" : "json",
                "dimensionsSpec" : {
                    "dimensions" : ["name", "email", "age"]
                },
                "timestampSpec" : {
                    "format" : "yyyy-MM-dd HH:mm:ss",
                     "column" : "date"
                }
            }
        },
        "metricsSpec" : [
            {
                "name" : "count",
                "type" : "count"
            },
            {
              "type" : "doubleSum",
              "name" : "age",
              "fieldName" : "age"
            }
        ]
    }
}

} }

If my schema looks like this: 如果我的架构如下所示:

Schema: name    email    age

And actual dataset looks like this: 实际数据集如下所示:

name    email    age    Bob    Jones    23    Billy    Jones    45

Is this how the columns should be formatted^^ in the above dataset for a TSV? 这是如何在TSV的上述数据集中格式化列^^? Like name email age should be first (the columns) and then the actual data. name email age一样, name email age应该是第一个(列),然后是实际数据。 I am confused how Druid will know how to map the columns to the actual dataset in TSV format. 我很困惑德鲁伊将如何知道如何将列映射到TSV格式的实际数据集。

TSV stands for tab separated format, so it looks the same as csv but you will use tabs instead of commas eg TSV代表制表符分隔格式,因此它看起来与csv相同,但您将使用制表符而不是逗号,例如

Name<TAB>Age<TAB>Address
Paul<TAB>23<TAB>1115 W Franklin
Bessy the Cow<TAB>5<TAB>Big Farm Way
Zeke<TAB>45<TAB>W Main St

you will use frist line as header to define your column names - so you can use "name", "age" or "email" in dimensions in your spec file 您将使用frist line作为标题来定义列名称 - 因此您可以在spec文件的维度中使用“name”,“age”或“email”

as for the gmt and utc, they are basically the same 至于gmt和utc,它们基本相同

There is no time difference between Greenwich Mean Time and Coordinated Universal Time 格林威治标准时间和协调世界时没有时间差

first one is time zone, the other one is a time standard 第一个是时区,另一个是时间标准

btw don`t forget to include a column with some time value in your tsv file!! 顺便说一句,忘记在你的tsv文件中包含一些具有时间价值的列!!

so eg if you will have tsv file that looks like: 所以,例如,如果你有tsv文件看起来像:

"name"  "position"  "office"    "age"   "start_date"    "salary"
"Airi Satou"    "Accountant"    "Tokyo" "33"    "2016-07-16T19:20:30+01:00" "162700"
"Angelica Ramos"    "Chief Executive Officer (CEO)" "London"    "47"    "2016-07-16T19:20:30+01:00" "1200000"

your spec file should look like this: 您的spec文件应如下所示:

{
    "spec" : {
        "ioConfig" : {
            "inputSpec" : {
                "type": "local",
                "baseDir": "path_to_folder",
                "filter": "name_of_the_file(s)"
            }
        },
        "dataSchema" : {
            "dataSource" : "local",
            "granularitySpec" : {
                "type" : "uniform",
                "segmentGranularity" : "hour",
                "queryGranularity" : "none",
                "intervals" : ["2016-07-01/2016-07-28"]
            },
            "parser" : {
                "type" : "string",
                "parseSpec" : {
                    "format" : "tsv",
                    "dimensionsSpec" : {
                        "dimensions" : [
                            "position",
                            "age",
                            "office"
                        ]
                    },
                    "timestampSpec" : {
                        "format" : "auto",
                         "column" : "start_date"
                    }
                }
            },
            "metricsSpec" : [
                {
                    "name" : "count",
                    "type" : "count"
                },
                {
                    "name" : "sum_sallary",
                    "type" : "longSum",
                    "fieldName" : "salary"
                }
            ]
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM