简体繁体 English

每天处理10 B行数据以创建变量（计算列）的最佳方法是什么？

[英]What would be the best approach to process 10 B rows of data on a daily basis to create variables (calculated columns)?

原文 2017-11-10 23:12:35 1 1 hadoop/ apache-spark/ architecture/ bigdata/ data-processing

Imagine you have a historical data and every day a couple of million rows of data gets added to it. 想象您有一个历史数据，每天有两百万行数据被添加到其中。 There is a need to process the whole data on a daily basis and update variables. 需要每天处理全部数据并更新变量。 How would you approach this problem using Big data platform? 您将如何使用大数据平台解决此问题？

Happy to provide more details if needed. 如果需要，很高兴提供更多详细信息。

1 个解决方案

Try very hard not to reprocess the whole 10B rows... I don't know what exactly you are looking for in that large of a dataset, but there is very likely a statistical model in which you can keep summary information, and just reprocess the incremental against that. 尝试不重新处理整个10B行...我不知道您在这么大的数据集中到底要寻找什么，但是很可能有一个统计模型可以保留摘要信息，而只需重新处理与此相对的增量。

cricket_007 is right though, HDFS and Spark are likely your first tools of choice. cricket_007是正确的，HDFS和Spark可能是您首选的工具。

最多200列的行的交互式查询的最佳大数据解决方案是什么？ - What is the best big data solution for interactive queries of rows with up to 200 columns?

数据基础架构实施-最佳方法 - Data Infrastructure Implementation - Best Approach

使用ES存储大数据并创建即时搜索的最佳方法是什么？ - What is the best way to store big data and create instant search with ES?

什么是存储和查询大气象数据数据集的更好方法 - What is a better approach of storing and querying a big dataset of meteorological data

在sqoop未将数据加载到HDFS中的apache配置单元中增量数据加载的最佳方法 - Best approach for incremental data load in apache hive where sqoop is not loading the data into HDFS

什么是在单个hbase表中创建多个hbase表或多个列族的最佳方法 - What is best approach creating multiple hbase tables or multiple column families in single hbase table

在自定义可写类型中键入变量的最佳方法是什么？ - What is the best way to type variables in a custom Writable type?

在Java中根据时间戳获取HBase表行 - Getting HBase table rows on the basis of timestamp in Java

最佳实践：如何通过更改“模式” /“列”来处理数据记录 - Best practice: how to handle data records with changing “schema” / “columns”

华硕 tuffx505dt windows 10 的 hadoop 的最佳版本是什么？ - what is the best version of hadoop for asus tuffx505dt windows 10?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 最多200列的行的交互式查询的最佳大数据解决方案是什么？ - What is the best big data solution for interactive queries of rows with up to 200 columns? 数据基础架构实施-最佳方法 - Data Infrastructure Implementation - Best Approach 使用ES存储大数据并创建即时搜索的最佳方法是什么？ - What is the best way to store big data and create instant search with ES? 什么是存储和查询大气象数据数据集的更好方法 - What is a better approach of storing and querying a big dataset of meteorological data 在sqoop未将数据加载到HDFS中的apache配置单元中增量数据加载的最佳方法 - Best approach for incremental data load in apache hive where sqoop is not loading the data into HDFS 什么是在单个hbase表中创建多个hbase表或多个列族的最佳方法 - What is best approach creating multiple hbase tables or multiple column families in single hbase table 在自定义可写类型中键入变量的最佳方法是什么？ - What is the best way to type variables in a custom Writable type? 在Java中根据时间戳获取HBase表行 - Getting HBase table rows on the basis of timestamp in Java 最佳实践：如何通过更改“模式” /“列”来处理数据记录 - Best practice: how to handle data records with changing “schema” / “columns” 华硕 tuffx505dt windows 10 的 hadoop 的最佳版本是什么？ - what is the best version of hadoop for asus tuffx505dt windows 10?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM