How to configure Apache Flume 1.4.0 to fetch data from Twitter and put in HDFS (Apache Hadoop version 2.5)?

Question

I am using Ubuntu 14.04 I have my configuration file as follows:

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = Q5JF4gVmrahNk93C913GjgJgB
TwitterAgent.sources.Twitter.consumerSecret = GFM6F0QuqEHn1eKpL1k4CHwdecEp626xLepajp9CAbtRBxEVCC
TwitterAgent.sources.Twitter.accessToken = 152956374-hTFXO9g1RBSn1yikmi2mQClilZe2PqnyqphFQh9t
TwitterAgent.sources.Twitter.accessTokenSecret = SODGEbkQvHYzZMtPsWoI2k9ZKiAd7q21ebtG3SNMu3Y0a
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
#number of events written to file before it is flushed to HDFS/default 100
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
#File size to trigger roll, in bytes (0: never roll based on file size)
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
#Number of events written to file before it rolled (0 = never roll based #on number of events)
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
#The maximum number of events stored in the channel
TwitterAgent.channels.MemChannel.capacity = 10000
#The maximum number of events the channel will take from a source or give to a sink per #transaction
TwitterAgent.channels.MemChannel.transactionCapacity = 100

I am using the following command on my terminal:

hadoopuser@Hotshot:/usr/lib/flume-ng/apache-flume-1.4.0-bin/bin$ ./flume-ng agent –conf ./conf/ -f /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

I am getting the following error:

14/10/10 17:24:12 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started
14/10/10 17:24:12 INFO twitter4j.TwitterStreamImpl: Establishing connection.
14/10/10 17:24:22 INFO twitter4j.TwitterStreamImpl: Connection established.
14/10/10 17:24:22 INFO twitter4j.TwitterStreamImpl: Receiving status stream.
14/10/10 17:24:22 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
14/10/10 17:24:22 INFO hdfs.BucketWriter: Creating hdfs://localhost:9000/user/flume/tweets//FlumeData.1412942062375.tmp
14/10/10 17:24:22 ERROR hdfs.HDFSEventSink: process failed
java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$RecoverLeaseRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2570)
    at java.lang.Class.privateGetPublicMethods(Class.java:2690)
    at java.lang.Class.privateGetPublicMethods(Class.java:2700)
    at java.lang.Class.getMethods(Class.java:1467)
    at sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
    at sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
    at java.lang.reflect.Proxy$ProxyClassFactory.apply(Proxy.java:672)
    at java.lang.reflect.Proxy$ProxyClassFactory.apply(Proxy.java:592)
    at java.lang.reflect.WeakCache$Factory.get(WeakCache.java:244)
    at java.lang.reflect.WeakCache.get(WeakCache.java:141)
    at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:455)
    at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:738)
    at org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
    at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:366)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:262)
    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:602)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:547)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:226)
    at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:220)
    at org.apache.flume.sink.hdfs.BucketWriter$8$1.run(BucketWriter.java:536)
    at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:160)
    at org.apache.flume.sink.hdfs.BucketWriter.access$1000(BucketWriter.java:56)
    at org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:533)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Is there any compatibility problem with the versions of Apache Flume and Apache Hadoop? I didn't find any good source that would help me installing Apache Flume version 1.5.1 If there are no compatibility problems, then what should I do to fetch tweets in my HDFS?

Answer 1

Hadoop is using protobuf 2.5

hadoop-project/pom.xml:    <protobuf.version>2.5.0</protobuf.version>

Code generated with protobuf 2.5 is binary incompatible with older protobuf libraries. And unfortunately the current stable release of Flume 1.4 packages protobuf 2.4.1. You can fix this by moving both protobuf and guava out of Flume's lib directory.

How to configure Apache Flume 1.4.0 to fetch data from Twitter and put in HDFS (Apache Hadoop version 2.5)?

Question

1 answers

solution1
1 ACCPTED 2014-10-12 03:46:30

How to configure Apache Flume 1.4.0 to fetch data from Twitter and put in HDFS (Apache Hadoop version 2.5)?

Question

1 answers

solution1 1 ACCPTED 2014-10-12 03:46:30

solution1
1 ACCPTED 2014-10-12 03:46:30