简体   繁体   中英

How to get the time cost of reading data from hdfs in Spark

Spark's Timeline contains:

  1. Scheduler Delay
  2. Task Deserialization Time
  3. Shuffle Read Time
  4. Executor Computing Time
  5. Shuffle Write Time
  6. Result Serialization Time
  7. Getting Result Time

It seems that the time cost of reading data from sources, such as hdfs, is included in Executor Computing Time . But I am not sure.

If it is in Executor Computing Time , how can I get it without including the time cost of computation.

Thanks.

It's hard to properly distinguish how long a read operation takes as processing is done on the data as it's being read.

A simple best-bet is just to apply a trivial operation (say, count) that will have very little overhead. If your file is sizable, read will vastly dominate the trivial operation, especially if it's one like count that can be done without shuffling data between nodes (aside from the single-value result).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM