简体   繁体   English

在Redshift中存储事件数据的最佳方法是什么?

[英]What's the best way to store event data in Redshift?

I'm new to Redshift and am looking at the best way to store event data. 我是Redshift的新手,正在寻找存储事件数据的最佳方法。 The data consists of an identifier, time and JSON metadata about the current state. 数据由标识符,时间和有关当前状态的JSON元数据组成。

I'm considering three approaches: 我正在考虑三种方法:

  1. Create a table for each event type with a column for each piece of data. 为每种事件类型创建一个表,并为每个数据创建一个列。
  2. Create a single table for events and store metadata as a JSON field. 为事件创建一个表,并将元数据存储为JSON字段。
  3. Create a single table with a column for every possible piece of data I might want to store. 为我可能要存储的每个可能的数据创建一个带有一列的表格。

The advantage of #1 is I can filter on all data fields and the solution is more flexible. #1的优点是我可以过滤所有数据字段,并且解决方案更加灵活。 The disadvantage is every time I want to add a new event I have to create a new table. 缺点是每次我想添加新事件时都必须创建一个新表。

The advantage of #2 is I can put all types of events into a single table. #2的优点是我可以将所有类型的事件放入一个表中。 The disadvantage is to filter on any of the data in the metadata I would need to use a JSON function on every row. 缺点是要过滤元数据中的任何数据,而我需要在每一行上使用JSON函数。

The advantage of #3 is I can easily access all the fields without running a function and don't have to create a new table for each type. #3的优点是我可以轻松访问所有字段,而无需运行函数,也不必为每种类型创建新表。 The disadvantage is whoever is using the data needs to remember which columns to ignore. 缺点是,使用数据的任何人都必须记住要忽略哪些列。

Is one of these ways better than the others or am I missing something entirely? 这些方法之一比其他方法好吗?还是我完全错过了某些东西?

This is a classic dilemma. 这是一个经典的难题。 After thinking for a while, in my company we ended up keeping the common properties of the events in separate columns and the unique properties in the JSON field. 经过一段时间的思考,在我公司中,我们最终将事件的通用属性保留在单独的列中,并将唯一属性保留在JSON字段中。 Examples of the common properties: 常见属性示例:

  • event type, timestamp (every event has it) 事件类型,时间戳(每个事件都有它)
  • URL (this will be missing for backend events and mobile app events but is present for all frontend events and is worth to have in a separate column) URL(后端事件和移动应用程序事件将缺少此URL,但所有前端事件均存在此URL,值得在单独的列中使用)
  • client properties: device, browser, OS (will be missing in backend but present in mobile app events and frontend events) 客户端属性:设备,浏览器,操作系统(将在后端丢失,但在移动应用程序事件和前端事件中存在)

Examples of unique properties (no such properties in other events): 唯一属性的示例(其他事件中没有此类属性):

  • test name and variant in AB test event AB测试事件中的测试名称和变体
  • product name or ID in purchase event 购买活动中的产品名称或ID

Borderline between common and unique property is your own judgement based on how many events share this property and how often will this property be used in the analytics queries to filter or group data. 共有和唯一属性之间的界线是您自己的判断,它取决于有多少事件共享该属性,以及在分析查询中使用该属性的频率来过滤或分组数据。 If some property is just "nice-to-have" and it is not involved in regular analysis use cases (yeah, we all love to store anything that is trackable just in case) the burden of maintaining a separate column is an overkill. 如果某些属性只是“必备”,并且不包含在常规分析用例中(是的,我们所有人都喜欢存储任何可跟踪的属性,以防万一),那么维护单独列的负担就太过分了。

Also, if you have some unique property that you use extensively in the queries there is a hacky way to optimize. 另外,如果您有一些在查询中广泛使用的独特属性,则可以采用一种不可靠的方法进行优化。 You can place this property at the beginning of your JSON column (yes, in Python JSON is not ordered but in Redshift it is a string, so the order of keys can be fixed if you want) and use LIKE with a wildcard only at the end of the field: 您可以将此属性放在JSON列的开头(是的,在Python中JSON不排序,但是在Redshift中它是字符串,因此可以根据需要固定键的顺序),并且仅在通配符上使用LIKE领域的结尾:

select * 
from event_table
where event_type='Start experiment'
and event_json like '{"test_name":"my_awesome_test"%'  -- instead of below
-- and json_extract_path_text(event_json,'test_name')='my_awesome_test'

LIKE used this way works much faster than JSON lookup (2-3x times faster) because it doesn't need to scan every row, decode JSON, find the key and check the value but it just checks if the string starts with a substring (much cheaper operation). LIKE用这种方法比JSON查找快得多(快了2-3倍),因为它不需要扫描每一行,解码JSON,查找键并检查值,但是它只是检查字符串是否以子字符串开头(便宜得多)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM