[英]scala -get file size of individual json in directory
I have a schema for json data defined as我有一个定义为 json 数据的模式
val gpsSchema: StructType =
StructType(Array(
StructField("Name",StringType,true),
StructField("GPS", ArrayType(
StructType(Array(
StructField("TimeStamp",DoubleType,true),
StructField("Longitude", DoubleType, true),
StructField("Latitude",DoubleType,true)
)),true),true)))
sample json data样品 json 数据
{"Name":"John","GPS":[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]}
I have 50 such json files in my input directory ("dbfs:/mnt/input_dir")我的输入目录中有 50 个这样的 json 文件(“dbfs:/mnt/input_dir”)
val my_dataframe = spark.read.schema(gpsSchema).json("dbfs:/mnt/input_dir")
my_dataframe.count() = 50
How can I get the file size for each json in my_dataframe using scala?如何使用 scala 获取 my_dataframe 中每个 json 的文件大小?
You can define a UDF as follows您可以按如下方式定义 UDF
import org.apache.hadoop.fs.{FileSystem, Path}
val fileSize = udf { loc:String =>
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val path = new Path(loc)
fs.getFileStatus(path).getLen
}
and call it on your data as follows并按如下方式在您的数据上调用它
import org.apache.spark.sql.functions._
val cols = Seq($"*",input_file_name(),fileSize(input_file_name()).as("file_size"))
val df = spark.read.format("json").load("data/path").select(cols:_*)
this will give you all the rows of the json as well as the file_path and its size.这将为您提供 json 的所有行以及 file_path 及其大小。
Please note with this approach we are trying to create a new FileSystem object for each row, which is highly inefficient so on large files of data you will have some performance implications.请注意,使用这种方法,我们正在尝试为每一行创建一个新的文件系统 object,这种方法效率非常低,因此对于大型数据文件,您将对性能产生一些影响。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.