简体繁体 English

Hbase Schema设计

[英]Hbase Schema design

原文 2013-04-14 23:29:04 2 2 hadoop/ nosql/ query-optimization/ hbase

I have to design an Hbase table to store users information, this information is targeted for social networking, like: age, sex, education, hobbies, read books, traveled countries ... NOTE: we could add more information in future, we dont know all information now. 我必须设计一个Hbase表来存储用户信息，该信息针对社交网络，例如：年龄，性别，教育程度，爱好，看书，旅行的国家...注意：我们将来可能会添加更多信息，我们不现在知道所有信息。

for example: name: Olha, age: 25, sex: female, education: bachelor Information technology, education: master computer science, hobby: basket ball, hobby: ping pong, book: gone with the wind, book: Davinci code, language: english, language: french, Country: Germany 例如：姓名：Olha，年龄：25，性别：女，教育程度：信息技术学士，教育程度：计算机科学硕士，爱好：篮子球，爱好：乒乓，书籍：随风而逝，书籍：达芬奇密码，语言：英语，语言：法语，国家：德国

The main idea is to be able to do queries like: return all people who are female, age: 22 years old, speak: english, speak: french, read the book gone with the wind, like ping pong, like basket ball and German. 主要思想是能够进行以下查询：返回所有女性，年龄：22岁，说：英语，说：法语，读随风而逝的书，例如乒乓球，像篮子球和德语。

so you can add any criteria to the search query. 因此您可以将任何条件添加到搜索查询中。

what is your suggestion about the HBASE table schema ( row key, column family ... ) that optimized this kind of search queries ( taking into consideration that we will add more information in future ) what is the best way to write such query ( scan, get, MapReduce ). 您对优化这种搜索查询的HBASE表模式（行键，列族...）有何建议（考虑到我们将来会添加更多信息），写这种查询（扫描）的最佳方法是什么？，获取MapReduce）。

Thank you 谢谢

2 个解决方案

I would agree with Ian Varley that Solr/Lucene and it's faceted queries and joins allow you to pivot the data in the way you want to see it - however - I also think your question might be a "counting" question or a "membership" question.... 我会同意Ian Varley的观点，Solr / Lucene及其多面的查询和联接使您可以按照想要查看的方式来旋转数据-但是-我也认为您的问题可能是“计数”问题或“成员身份”题....

It sounds like you are after a list of people who match (N) attributes - the problem you have is that for each attribute you could have millions of user ids? 听起来好像您是在匹配（N）个属性的人员列表之后-您遇到的问题是，对于每个属性，您可能拥有数百万个用户ID？

HBase is a good fit when all you are trying to do is compute intersection/union sizes.. Your key/value pairs can be put into Hbase, and you can "encode" the IDs of the users into either a Bloom Filter and HyperLogLog. 当您只想计算交点/联合大小时，HBase非常适合。您的键/值对可以放入Hbase中，并且可以将用户的ID“编码”为Bloom Filter和HyperLogLog。 Trading speed for accuracy and memory. 准确性和存储性的交易速度。 Likely running map/reduce style jobs hourly/nightly on click-streams of log aggregation of some type. 在某种类型的日志聚合的点击流上，可能每小时/每晚运行一次map / reduce样式作业。

Others have done this in the advertising space and online space for exactly the type of queries you are running ( "find people who like red bull and pop-tarts that live in florida" ) 其他人则在广告空间和在线空间中针对您正在运行的查询类型进行了此操作（ “找到喜欢住在佛罗里达的红牛和流行皮特的人” ）

References 参考文献

Contextual Advertising using Apache Hive and Amazon EMR http://aws.amazon.com/articles/2855 使用Apache Hive和Amazon EMR进行内容相关广告http://aws.amazon.com/articles/2855

Scaling Distributed Counters: http://whynosql.com/scaling-distributed-counters/ 扩展分布式计数器： http : //whynosql.com/scaling-distributed-counters/

Google: Sharding counters https://developers.google.com/appengine/articles/sharding_counters Google：分片计数器https://developers.google.com/appengine/articles/sharding_counters

Distributed Counter Performance in HBase - Part 1 http://palominodb.com/blog/2012/08/24/distributed-counter-performance-hbase-part-1 HBase中的分布式计数器性能-第1部分http://palominodb.com/blog/2012/08/24/distributed-counter-performance-hbase-part-1

Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html Facebook的新实时分析系统：HBase每天处理200亿个事件http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html

Realtime Analytics with Hadoop and HBase - http://www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase 使用Hadoop和HBase进行实时分析-http: //www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase

Log Event Processing with HBase http://tellapart.com/log-event-processing-with-hbase 使用HBase进行日志事件处理http://tellapart.com/log-event-processing-with-hbase

Clickstream Analytics at BazaarVoice http://www.slideshare.net/bazaarvoice_engineering/austin-scales-clickstream-analytics BazaarVoice上的Clickstream Analytics http://www.slideshare.net/bazaarvoice_engineering/austin-scales-clickstream-analytics

Realtime Analytics with HBase - http://www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-long-version 使用HBase进行实时分析-http: //www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-long-version

This isn't a great use of HBase, in the sense that this is exactly the kind of thing that search indexes (like Lucene) are good for. 从某种意义上说，这并不是HBase的好用，因为这正是搜索索引（如Lucene）所擅长的。

One normal schema to store users and their information might look a lot like a relational database, in that you'd have 1 row per user, and store all the attributes as columns & values (age=22, language=french, etc). 一种用于存储用户及其信息的普通模式可能看起来很像一个关系数据库，因为每个用户只有1行，并将所有属性存储为列和值（age = 22，language = french等）。 This works well for the extensibility you mention (you don't need to change any schema in order to store new attributes). 这对于您提到的可扩展性非常有效（您无需更改任何架构即可存储新属性）。 With this schema, you could look up any one user (and all of their attributes) by the unique user id. 使用这种模式，您可以通过唯一的用户ID查找任何一个用户（及其所有属性）。 That'd be blazingly fast to do, no matter how many users you have. 无论您有多少用户，这都将是非常快的事情。

However, with that schema, if you want to search in the way you describe ("return all users whose age is 22"), every single query is going to end up being a scan of the entire table, because HBase only allows you to access things via their primary key; 但是，使用该架构，如果您要按照描述的方式进行搜索（“返回年龄在22岁以下的所有用户”），则每个查询最终都将是对整个表的扫描，因为HBase仅允许您执行以下操作：通过主键访问事物； it does not have secondary indexing of any kind. 它没有任何二级索引。 That will be extremely inefficient (picture having to scan a million rows every time you want to do any single query). 这将是非常低效的（每次您要进行任何单个查询时，图片必须扫描一百万行）。

How to fix this? 如何解决这个问题？ You could "reverse" the ordering of the data and put the values in the row key and then point to all the users with that value. 您可以“反转”数据的顺序，然后将值放入行键，然后指向具有该值的所有用户。 For example, the row key could be "age:22", and then in the columns of the row could be all the userids that are age 22. This is problematic for a lot of reasons, not least of which is that it will be extremely expensive and tricky to make updates. 例如，行键可以是“ age：22”，然后在行的列中可以是所有年龄为22的用户ID。这有很多原因，这是有问题的，不仅是因为更新非常昂贵且棘手。 But it would perform well for those specific queries. 但是对于那些特定的查询，它会表现良好。

The trick? 俩？ That's exactly what a search index (like Lucene) does, and it does it much better than you could by rolling your own with HBase. 这正是搜索索引（如Lucene）所做的事情，并且比使用HBase滚动搜索索引要好得多。 That sounds like the tool you want to be using here. 听起来像您想在这里使用的工具。

If you must use HBase (as you say, since it's a research project) it might be worth looking into using HBase and Lucene together; 如果您必须使用HBase（如您所说，因为它是一个研究项目），那么可能值得一起使用HBase和Lucene。 google that for pointers. 谷歌的指针。