简体   繁体   English

每天处理10 B行数据以创建变量(计算列)的最佳方法是什么?

[英]What would be the best approach to process 10 B rows of data on a daily basis to create variables (calculated columns)?

Imagine you have a historical data and every day a couple of million rows of data gets added to it. 想象您有一个历史数据,每天有两百万行数据被添加到其中。 There is a need to process the whole data on a daily basis and update variables. 需要每天处理全部数据并更新变量。 How would you approach this problem using Big data platform? 您将如何使用大数据平台解决此问题?

Happy to provide more details if needed. 如果需要,很高兴提供更多详细信息。

Try very hard not to reprocess the whole 10B rows... I don't know what exactly you are looking for in that large of a dataset, but there is very likely a statistical model in which you can keep summary information, and just reprocess the incremental against that. 尝试不重新处理整个10B行...我不知道您在这么大的数据集中到底要寻找什么,但是很可能有一个统计模型可以保留摘要信息,而只需重新处理与此相对的增量。

cricket_007 is right though, HDFS and Spark are likely your first tools of choice. cricket_007是正确的,HDFS和Spark可能是您首选的工具。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 最多200列的行的交互式查询的最佳大数据解决方案是什么? - What is the best big data solution for interactive queries of rows with up to 200 columns? 数据基础架构实施-最佳方法 - Data Infrastructure Implementation - Best Approach 使用ES存储大数据并创建即时搜索的最佳方法是什么? - What is the best way to store big data and create instant search with ES? 什么是存储和查询大气象数据数据集的更好方法 - What is a better approach of storing and querying a big dataset of meteorological data 在sqoop未将数据加载到HDFS中的apache配置单元中增量数据加载的最佳方法 - Best approach for incremental data load in apache hive where sqoop is not loading the data into HDFS 什么是在单个hbase表中创建多个hbase表或多个列族的最佳方法 - What is best approach creating multiple hbase tables or multiple column families in single hbase table 在自定义可写类型中键入变量的最佳方法是什么? - What is the best way to type variables in a custom Writable type? 在Java中根据时间戳获取HBase表行 - Getting HBase table rows on the basis of timestamp in Java 最佳实践:如何通过更改“模式” /“列”来处理数据记录 - Best practice: how to handle data records with changing “schema” / “columns” 华硕 tuffx505dt windows 10 的 hadoop 的最佳版本是什么? - what is the best version of hadoop for asus tuffx505dt windows 10?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM