简体繁体 English

将 Java 应用程序迁移到 Hadoop：架构/设计障碍？

[英]Migrating A Java Application to Hadoop : Architecture/Design Roadblocks?

原文 2011-06-06 20:22:52 9 1 java/ architecture/ hadoop

Alrite.. so.. here's a situation: I am responsible for architect-ing the migration of an ETL software (EAI, rather) that is java-based. Alrite.. 所以.. 这是一种情况：我负责构建基于 Java 的 ETL 软件（更确切地说是 EAI）的迁移。 I'll have to migrate this to Hadoop (the apache version).我必须将它迁移到 Hadoop（apache 版本）。 Now, technically this is more like a reboot and not a migration - coz I've got no database to migrate.现在，从技术上讲，这更像是重新启动而不是迁移 - 因为我没有要迁移的数据库。 This is about leveraging Hadoop, such that, the Transformation phase (of 'ETL') is parallel-iz-ed.这是关于利用 Hadoop，这样，转换阶段（'ETL'）是并行化的。 This would make my ETL software,这将使我的 ETL 软件，

Faster - with transformation parallel-iz-ed.更快 - 转换并行化。
Scalable - Handling more data / big data is about adding more nodes.可扩展 - 处理更多数据/大数据是关于添加更多节点。
Reliable - Hadoop's redundancy and reliability will add to my product's features.可靠 - Hadoop 的冗余和可靠性将增加我产品的功能。

I've tested this configuration out - changed my transformation algos into a mapreduce model, tested it out on a high end Hadoop cluster and bench-marked the performance.我已经对此配置进行了测试 - 将我的转换算法更改为 mapreduce model，在高端 Hadoop 集群上对其进行了测试，并对性能进行了基准测试。 Now, I'm trying to understand and document all those things that could stand in the way of this application redesign/ rearch / migration.现在，我正在尝试理解和记录所有可能阻碍此应用程序重新设计/研究/迁移的事情。 Here's a few I could think of:这里有几个我能想到的：

The other two phases: Extraction and Load - My ETL tool can handle a variety of datasources - So, do I redesign my data adapters to read data from these data sources, load it to HDFS and then transform it and load it into the target datasource?其他两个阶段：提取和加载——我的 ETL 工具可以处理各种数据源——那么，我是否要重新设计我的数据适配器以从这些数据源读取数据，将其加载到 HDFS，然后对其进行转换并将其加载到目标数据源中? Could this step act as a huge bottleneck to the entire architecture?这一步会成为整个架构的巨大瓶颈吗？
Feedback: So my transformation fails on a record - how do I let the end user know that the ETL hit an error on a particular record?反馈：所以我的转换在记录上失败 - 我如何让最终用户知道 ETL 在特定记录上遇到错误？ In short, how do I keep track of what is actually going on at the app level with all the maps/reduces/merges and sorts happening - The default Hadoop web interface is not for the end-user - its for admins.简而言之，我如何跟踪应用程序级别实际发生的所有映射/减少/合并和排序 - 默认 Hadoop web 界面不适用于最终用户 - 它适用于管理员。 So should I build a new web app that scrapes from the Hadoop web interface?那么我应该构建一个从 Hadoop web 接口刮取的新 web 应用程序吗？ (I know this is not recommended) （我知道不建议这样做）
Security: How do I handle authorization at Hadoop level?安全性：如何处理 Hadoop 级别的授权？ Who can run jobs, who are not allowed to run 'em - how to support ACL?谁可以运行作业，谁不允许运行它们 - 如何支持 ACL？

I look forward to hearing from you with possible answers to above questions and more questions/facts I'd need to consider, based on your experiences with Hadoop / problem analysis.根据您对 Hadoop/问题分析的经验，我期待收到您对上述问题的可能答案以及我需要考虑的更多问题/事实。 Like always, I appreciate your help and thank ya all in advance.像往常一样，我感谢您的帮助，并提前感谢大家。

1 个解决方案

I do not expect loading to the HDFS to be a bottlneck, since the load is distributed among datanodes - so the network interface will be only bottleneck.我不希望加载到 HDFS 成为瓶颈，因为负载分布在数据节点之间 - 所以网络接口将只是瓶颈。 Loading data back to the database might be a bottlneck but I think it is no worse then now.将数据加载回数据库可能是一个瓶颈，但我认为现在情况并没有更糟。 I would design jobs to have their input and their output to sit in the HDFS, and then run some kind of bulk load of results into the database.我会设计作业以将他们的输入和他们的 output 放在 HDFS 中，然后将某种批量结果加载到数据库中。
Feedback is a problematic point, since actually MR have only one result - and it is transformed data.反馈是一个问题点，因为实际上 MR 只有一个结果——它是转换后的数据。 All other tricks, like write failed records into HDFS files will lack "functional" reliability of the MR, because it is a side effect.所有其他技巧，例如将失败的记录写入 HDFS 文件将缺乏 MR 的“功能”可靠性，因为它是一个副作用。 One of the ways to mitigate this problem you should design you software in the way to be ready for duplicated failed records.缓解此问题的一种方法是，您应该以准备好重复失败记录的方式设计软件。 There is also scoop = the tool specific for migrating data between SQL databases and Hadoop.还有 scoop = 专用于在 SQL 数据库和 Hadoop 数据库之间迁移数据的工具。 http://www.cloudera.com/downloads/sqoop/ In the same time I would consider usage of HIVE - if Your SQL transformations are not that complicated - it might be practical to create CSV files, and make initial preaggregation with Hive, therof reducing data volumes before going to (perhaps single node) database. http://www.cloudera.com/downloads/sqoop/ In the same time I would consider usage of HIVE - if Your SQL transformations are not that complicated - it might be practical to create CSV files, and make initial preaggregation with Hive,在进入（可能是单节点）数据库之前减少数据量。