简体   繁体   English

Pentaho Hadoop文件输入

[英]Pentaho Hadoop File Input

I'm trying to retrieve data from an standalone Hadoop (version 2.7.2 qith properties configured by default) HDFS using Pentaho Kettle (version 6.0.1.0-386 ). 我正在尝试使用Pentaho Kettle (版本6.0.1.0-386 )从独立的Hadoop (默认配置的版本2.7.2 qith属性)中检索数据。 Pentaho and Hadoop are not in the same machine but I have acces from one to another. Pentaho和Hadoop不在同一台机器上,但是我拥有彼此的权限。

I created a new "Hadoop File Input" with the following properties: 我创建了一个具有以下属性的新“ Hadoop文件输入”:

Environment File/Folder Wildcard Rquired Include subfolders url-to-file NN 需要环境文件/文件夹通配符包括子文件夹url到文件NN

url-to-file is built like: ${PROTOCOL}://${USER}:${PASSWORD}@${IP}:${PORT}${PATH_TO_FILE} 网址到文件的构建方式如下:$ {PROTOCOL}:// $ {USER}:$ {PASSWORD} @ $ {IP}:$ {PORT} $ {PATH_TO_FILE}

eg: hdfs://hadoop:@the_ip:50010/user/hadoop/red_libelium/Ikusi/libelium_waspmote_AC_2_libelium_waspmote/libelium_waspmote_AC_2_libelium_waspmote.txt 例如:hdfs:// hadoop:@the_ip:50010 / user / hadoop / red_libelium / Ikusi / libelium_waspmote_AC_2_libelium_waspmote / libelium_waspmote_AC_2_libelium_waspmote.txt

Password is empty 密码为空

I checked and this file exist in HDFS and downloaded correctly via web mannager and using haddop command line. 我检查了该文件是否存在于HDFS中,并通过网络管理器并使用haddop命令行正确下载了该文件。

Scenario A) When I'm using ${PROTOCOL} = hdfs and ${PORT} = 50010 I'm getting error in both Pentaho and Hadoop consoles: 方案A)当我使用$ {PROTOCOL} = hdfs和$ {PORT} = 50010时,我在Pentaho和Hadoop控制台中都遇到错误:

Pentaho: Pentaho的:

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016/04/05 15:23:46 - FileInputList - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : org.apache.commons.vfs2.FileSystemEx
ception: Could not list the contents of folder "hdfs://hadoop@172.21.0.35:50010/user/hadoop/red_libelium/Ikusi/libelium_waspmote_AC_2_libelium_waspmot
e/libelium_waspmote_AC_2_libelium_waspmote.txt".
2016/04/05 15:23:46 - FileInputList -   at org.apache.commons.vfs2.provider.AbstractFileObject.getChildren(AbstractFileObject.java:1193)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.core.fileinput.FileInputList.createFileList(FileInputList.java:243)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.core.fileinput.FileInputList.createFileList(FileInputList.java:142)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.trans.steps.textfileinput.TextFileInputMeta.getTextFileList(TextFileInputMeta.java:1580)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.trans.steps.textfileinput.TextFileInput.init(TextFileInput.java:1513)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.trans.step.StepInitThread.run(StepInitThread.java:69)
2016/04/05 15:23:46 - FileInputList -   at java.lang.Thread.run(Thread.java:745)
2016/04/05 15:23:46 - FileInputList - Caused by: java.io.EOFException: End of File Exception between local host is: "EI001115/192.168.231.248"; destin
ation host is: "172.21.0.35":50010; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2016/04/05 15:23:46 - FileInputList -   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
2016/04/05 15:23:46 - FileInputList -   at com.sun.proxy.$Proxy70.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTrans
latorPB.java:554)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2016/04/05 15:23:46 - FileInputList -   at java.lang.reflect.Method.invoke(Method.java:606)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
2016/04/05 15:23:46 - FileInputList -   at com.sun.proxy.$Proxy71.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:693)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl$9.call(HadoopFileSystemImpl.java:126)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl$9.call(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl.callAndWrapExceptions(HadoopFileSystemImpl
.java:200)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl.listStatus(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.big.data.impl.vfs.hdfs.HDFSFileObject.doListChildren(HDFSFileObject.java:115)
2016/04/05 15:23:46 - FileInputList -   at org.apache.commons.vfs2.provider.AbstractFileObject.getChildren(AbstractFileObject.java:1184)
2016/04/05 15:23:46 - FileInputList -   ... 6 more
2016/04/05 15:23:46 - FileInputList - Caused by: java.io.EOFException
2016/04/05 15:23:46 - FileInputList -   at java.io.DataInputStream.readInt(DataInputStream.java:392)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
2016/04/05 15:23:48 - cfgbuilder - Warning: The configuration parameter [org] is not supported by the default configuration builder for scheme: sftp

Hadoop: Hadoop的:

2016-04-05 14:22:56,045 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: fiware-hadoop:50010:DataXceiver error processing unknown operation  src: /192.168.231.248:62961 dst: /172.21.0.35:50010
java.io.IOException: Version Mismatch (Expected: 28, Received: 26738 )
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:60)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
        at java.lang.Thread.run(Thread.java:745)

Scenario Other) In other cases using different por number (50070, 9000...) I'm just getting error from Pentaho, Hadoop standalone seems not to be receiving any request. 场景其他)在其他情况下使用不同的por编号(50070、9000 ...),我只是从Pentaho那里得到错误,Hadoop独立服务器似乎没有收到任何请求。

Reading some documentation of Pentaho it seems that the Big Data plugin is buit form Hadoop v 2.2.x, since I'm trying to connect to a 2.7.2. 阅读Pentaho的一些文档后,似乎大数据插件是从Hadoop v 2.2.x开始的,因为我正尝试连接到2.7.2。 Can it be the source of the problem? 可能是问题的根源吗? Is there any pluging working with higher versions? 更高版本是否有任何插件? Os simply my url to HDFS file is wrong? 仅仅是我的HDFS文件网址错误?

Thanks you everyone for your time, any hint will be more than welcome. 谢谢大家的宝贵时间,任何提示都将受到欢迎。

I will answer the question myself because I solved the issue and it too large for a simple comment. 我将亲自回答该问题,因为我已解决了该问题,而且该问题太大了,无法简单评论。

This issue was solved making some changes in Hadoop configuration. 解决了该问题,并在Hadoop配置中进行了一些更改。

  1. I changed configuration in core-site.xml 我在core-site.xml中更改了配置

from: 从:

<property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop:9000</value>
</property>

to: 至:

<property>
    <name>fs.default.name</name>
    <value>hdfs://server_ip_address:8020</value>
</property>

Since I was having problems with port 9000 I finally changed to port 8020 ( related issue ). 由于端口9000出现问题,因此我最终更改为端口8020( 相关问题 )。

  1. Open your port 8020 (just in case you have some firewall rule) 打开端口8020(以防万一您有防火墙规则)
  2. Pentaho Kettle transformation url will be like: ${PROTOCOL}://${USER}:${PASSWORD}@${HOST}:${PORT}${FILE_PATH} Now ${PORT} will be 8020. Pentaho Kettle转换网址将类似于: $ {PROTOCOL}:// $ {USER}:$ {PASSWORD} @ $ {HOST}:$ {PORT} $ {FILE_PATH}现在$ {PORT}将为8020。

This way I was able to preview data from HDFS via Pentaho transformation. 这样,我就可以通过Pentaho转换预览来自HDFS的数据。

Thanks you all for your time. 谢谢大家的时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM