scala - 获取目录中单个 json 的文件大小

Question

I have a schema for json data defined as我有一个定义为 json 数据的模式

val gpsSchema: StructType = 
  StructType(Array(
    StructField("Name",StringType,true),
    StructField("GPS", ArrayType(
      StructType(Array(
          StructField("TimeStamp",DoubleType,true),
          StructField("Longitude", DoubleType, true),
          StructField("Latitude",DoubleType,true)
          )),true),true)))

sample json data样品 json 数据

{"Name":"John","GPS":[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052}, 
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038}, 
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022}, 
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]}

I have 50 such json files in my input directory ("dbfs:/mnt/input_dir")我的输入目录中有 50 个这样的 json 文件（“dbfs:/mnt/input_dir”）

val my_dataframe = spark.read.schema(gpsSchema).json("dbfs:/mnt/input_dir")
my_dataframe.count() = 50

How can I get the file size for each json in my_dataframe using scala?如何使用 scala 获取 my_dataframe 中每个 json 的文件大小？

Answer 1

You can define a UDF as follows您可以按如下方式定义 UDF

  import org.apache.hadoop.fs.{FileSystem, Path}
  val fileSize = udf { loc:String =>
    val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
    val path = new Path(loc)
    fs.getFileStatus(path).getLen
  }

and call it on your data as follows并按如下方式在您的数据上调用它

   import org.apache.spark.sql.functions._
   val cols = Seq($"*",input_file_name(),fileSize(input_file_name()).as("file_size"))
   val df = spark.read.format("json").load("data/path").select(cols:_*)

this will give you all the rows of the json as well as the file_path and its size.这将为您提供 json 的所有行以及 file_path 及其大小。

Please note with this approach we are trying to create a new FileSystem object for each row, which is highly inefficient so on large files of data you will have some performance implications.请注意，使用这种方法，我们正在尝试为每一行创建一个新的文件系统 object，这种方法效率非常低，因此对于大型数据文件，您将对性能产生一些影响。

scala - 获取目录中单个 json 的文件大小

问题描述

1 个解决方案

解决方案1
1 2022-01-19 06:45:09

scala - 获取目录中单个 json 的文件大小

问题描述

1 个解决方案

解决方案1 1 2022-01-19 06:45:09

解决方案1
1 2022-01-19 06:45:09