简体   繁体   English

关于hadoop文件系统transferFromLocalFile

[英]about hadoop filesystem transferFromLocalFile

I am writing code to transfer files to hadoop hdfs parallel. 我正在编写代码以将文件并行传输到hadoop hdfs。 So I have many threads calling filesystem.copyFromLocalFile. 所以我有很多线程在调用filesystem.copyFromLocalFile。

I think the cost of opening a filesystem is not small, so I just have one filesystem opened in my project. 我认为打开文件系统的成本并不小,因此我在项目中只打开了一个文件系统。 So I though there might be aa problem when so many threads calling it at the same time. 因此,尽管有那么多线程同时调用它,可能会出现问题。 But so far, it works fine with no problem. 但是到目前为止,它工作正常,没有问题。

Could anyone please give me some information about this copy method? 有人可以给我一些有关此复制方法的信息吗? Thank you very much& have a great weekend. 非常感谢,祝您周末愉快。

I see the following design points to consider: 我看到以下设计要点:
a) Where will be bottleneck of the process? a)该过程的瓶颈在哪里? I think in 2-3 parallel copy operations local disk or 1GB Ethernet will became a bottleneck. 我认为在2-3个并行复制操作中,本地磁盘或1GB以太网将成为瓶颈。 You can do it in form of multithreaded application or you can run a few processes. 您可以以多线程应用程序的形式执行此操作,也可以运行一些进程。 In any case I do not think you need a high level of parallelism. 无论如何,我认为您不需要高水平的并行性。 b) Error handling. b)错误处理。 Failure of the one thread should not stop the whole process, and, in the same time file should not be lost. 一个线程的失败不应停止整个过程,同时文件也不应丢失。 What I am usually doing in such cases is to agree that in a worst case file can be copied twice. 在这种情况下,我通常要做的就是在最坏的情况下同意将文件复制两次。 If it is Ok - system can work in simple "copy then delete" scenario. 如果可以,系统可以在简单的“复制然后删除”方案中工作。 c) If you copy from the one of the cluster nodes - HDFS will became unbalanced, since one replica will be stored on the host from where you copy. c)如果从群集节点之一进行复制-HDFS将变得不平衡,因为一个副本将存储在您复制的主机上。 You will need to do the balance constantly. 您将需要不断进行平衡。

Can you tell me what more information you want about copyFromLocalFile()? 您能告诉我有关copyFromLocalFile()的更多信息吗?

I'm not sure but I guess in your case, threads share the same resource among themselves. 我不确定,但是我猜在您的情况下,线程之间共享相同的资源。 Since, you have only one instance of FileSystem, each thead will probably share this object in a time sharing basis among themselves. 由于您只有一个FileSystem实例,因此每个主题可能会在时间共享的基础上共享它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM