简体   繁体   English

scala - 获取目录中单个 json 的文件大小

[英]scala -get file size of individual json in directory

I have a schema for json data defined as我有一个定义为 json 数据的模式

val gpsSchema: StructType = 
  StructType(Array(
    StructField("Name",StringType,true),
    StructField("GPS", ArrayType(
      StructType(Array(
          StructField("TimeStamp",DoubleType,true),
          StructField("Longitude", DoubleType, true),
          StructField("Latitude",DoubleType,true)
          )),true),true)))

sample json data样品 json 数据

{"Name":"John","GPS":[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052}, 
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038}, 
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022}, 
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]}

I have 50 such json files in my input directory ("dbfs:/mnt/input_dir")我的输入目录中有 50 个这样的 json 文件(“dbfs:/mnt/input_dir”)

val my_dataframe = spark.read.schema(gpsSchema).json("dbfs:/mnt/input_dir")
my_dataframe.count() = 50

How can I get the file size for each json in my_dataframe using scala?如何使用 scala 获取 my_dataframe 中每个 json 的文件大小?

You can define a UDF as follows您可以按如下方式定义 UDF

  import org.apache.hadoop.fs.{FileSystem, Path}
  val fileSize = udf { loc:String =>
    val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
    val path = new Path(loc)
    fs.getFileStatus(path).getLen
  }

and call it on your data as follows并按如下方式在您的数据上调用它

   import org.apache.spark.sql.functions._
   val cols = Seq($"*",input_file_name(),fileSize(input_file_name()).as("file_size"))
   val df = spark.read.format("json").load("data/path").select(cols:_*)

this will give you all the rows of the json as well as the file_path and its size.这将为您提供 json 的所有行以及 file_path 及其大小。

Please note with this approach we are trying to create a new FileSystem object for each row, which is highly inefficient so on large files of data you will have some performance implications.请注意,使用这种方法,我们正在尝试为每一行创建一个新的文件系统 object,这种方法效率非常低,因此对于大型数据文件,您将对性能产生一些影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM