简体   繁体   中英

Flume won't load Twitter data to HDFS

I am trying to load Twitter data into Hadoop. It says that it has processed nearly 25000 files, but when I check Hadoop I always find the folder empty. This is the command I am using

flume-ng agent -n TwitterAgent -f flume.conf

Here is a small caption

21/07/18 19:40:03 INFO twitter.TwitterSource: Processed 25,000 docs 21/07/18 19:40:03 INFO twitter.TwitterSource: Total docs indexed: 25,000, total skipped docs: 0 21/07/18 19:40:03 INFO twitter.TwitterSource: 45 docs/second 21/07/18 19:40:03 INFO twitter.TwitterSource: Run took 545 seconds and processed: 21/07/18 19:40:03 INFO twitter.TwitterSource: 0.012 MB/sec sent to index 21/07/18 19:40:03 INFO twitter.TwitterSource: 6.708 MB text sent to index 21/07/18 19:40:03 INFO twitter.TwitterSource: There were 0 exceptions ignored: 21/07/18 19:40:05 INFO twitter.TwitterSource: Processed 25,100 docs 21/07/18 19:40:06 INFO hdfs.BucketWriter: Creating /home/hadoopusr/flumetweets/FlumeData.1626629459197.tmp 21/07/18 19:40:06 WARN hdfs.HDFSEventSink: HDFS IO error org.apache.hadoop.fs.ParentNotDirectoryException: /home (is not a directory) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkIsDirectory(FSPermissionChecker.java:538) at org.apache.hadoop.hdfs.server.namenode.FSPermi ssionChecker.checkTraverse(FSPermissionChecker.java:278) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:206) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:189) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:507) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1612) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1630) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:551) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2282) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2225) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer. create(NameNodeRpcServer.java:728)

This is my Flume.config file

#Naming the components on the current agent.

TwitterAgent.sources = Twitter

TwitterAgent.channels = MemChannel

TwitterAgent.sinks = HDFS

#Describing/Configuring the source

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource

TwitterAgent.sources.Twitter.channels=MemChannel

TwitterAgent.sources.Twitter.consumerKey = ************

TwitterAgent.sources.Twitter.consumerSecret =************

TwitterAgent.sources.Twitter.accessToken = ************

TwitterAgent.sources.Twitter.accessTokenSecret = ************

TwitterAgent.sources.Twitter.keywords =covid,covid-19,coronavirus

#Describing/Configuring the sink TwitterAgent.sinks.HDFS.type = hdfs

TwitterAgent.sinks.HDFS.hdfs.path = /home/hadoopusr/flumetweets

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream

TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

TwitterAgent.sinks.HDFS.hdfs.batchSize = 10

TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600

TwitterAgent.sinks.HDFS.hdfs.rollCount = 100

#Describing/Configuring the channel

TwitterAgent.channels.MemChannel.type = memory

TwitterAgent.channels.MemChannel.capacity = 1000

TwitterAgent.channels.MemChannel.transactionCapacity = 1000

#Binding the source and sink to the channel

TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sinks.HDFS.channel = MemChannel

As commented, you fixed your first error, now you get a permission error upon writing to the HDFS root path as the user=amel

In your config you have

TwitterAgent.sinks.HDFS.hdfs.path = /home/hadoopusr/flumetweets

But, I'm guessing either /home or /home/hadoopusr does not exist, so that directory is trying to get created.

However, your user is not hadoopusr (your HDFS superuser), so there is not permissions to do so

Your options therefore are either

  1. run flume-ng agent as the hadoopusr ( sudo su hadoopusr -c flume-ng agent ... )
  2. Change the HDFS path in the config to use /home/amel (after you create that path and give yourself permissions on it) sudo su hadoopusr; hadoop fs -mkdir /home/amel; hadoop fs chown -R amel /home/amel; hadoop fs -chmod -R 760 /home/amel sudo su hadoopusr; hadoop fs -mkdir /home/amel; hadoop fs chown -R amel /home/amel; hadoop fs -chmod -R 760 /home/amel

尝试将其添加到您的 Flume.config 文件中: TwitterAgent.sinks.HDFS.type = hdfs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM