简体   繁体   中英

To speed up hive process, how to adjust mapper and reducer number using tez

I tried the process(word labeling of sentence) of large data(about 150GB) using tez , but the problem is that it took so much time(1week or more),then

I tried to specify number of mapper. Though I set mapred.map.tasks =2000, but I can't stop mapper being set to about 150, so I can't do what I want to do.

I specify the map value in oozie workflow file and use the tez.

How can I specify the number of mapper?

Finally I want to speed up the process, it is ok not to use tez.

In addition, I would like to count labeled sentence by reducer, it takes so much time,too.

And , I also want to know how I adjust memory size to use each mapper and reducer process.

In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used...

... set tez.grouping.split-count=4 will create 4 mappers

https://community.pivotal.io/s/article/How-to-manually-set-the-number-of-mappers-in-a-TEZ-Hive-job


However, overall, you should optimize the storage format and the Hive partitions before you even begin tuning the Tez settings . Do not try and process data STORED AS TEXT in Hive. Convert it to ORC or Parquet first.

If Tez isn't working out for you, you can always try Spark. Plus labelling sentences is probably a Spark MLlib worlflow you can find somewhere

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM