[英]What would be the best approach to process 10 B rows of data on a daily basis to create variables (calculated columns)?
Imagine you have a historical data and every day a couple of million rows of data gets added to it. 想象您有一个历史数据,每天有两百万行数据被添加到其中。 There is a need to process the whole data on a daily basis and update variables.
需要每天处理全部数据并更新变量。 How would you approach this problem using Big data platform?
您将如何使用大数据平台解决此问题?
Happy to provide more details if needed. 如果需要,很高兴提供更多详细信息。
Try very hard not to reprocess the whole 10B rows... I don't know what exactly you are looking for in that large of a dataset, but there is very likely a statistical model in which you can keep summary information, and just reprocess the incremental against that. 尝试不重新处理整个10B行...我不知道您在这么大的数据集中到底要寻找什么,但是很可能有一个统计模型可以保留摘要信息,而只需重新处理与此相对的增量。
cricket_007 is right though, HDFS and Spark are likely your first tools of choice. cricket_007是正确的,HDFS和Spark可能是您首选的工具。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.