简体   繁体   中英

Apache Parquet data storage engine?

From link sql-data-sources-parquet I see below code snippet stores the data parquet format but per mine understanding from wiki is just a format not a storage engine. So Parquet will stores the data in specific format on some storage engine like HDFS/S3/Cassandra etc. Is n't it ? So my question is where below code snippet will store th data as I do not see any mention of storage engine like HDFS/S3/Cassandra etc

Dataset<Row> peopleDF = spark.read().json("examples/src/main/resources/people.json");

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write().parquet("people.parquet");

// Read in the Parquet file created above.
// Parquet files are self-describing so the schema is preserved
// The result of loading a parquet file is also a DataFrame
Dataset<Row> parquetFileDF = spark.read().parquet("people.parquet");

It is deduced from the URL scheme, for example s3://examples/src/main/resources/people.json or hdfs://examples/src/main/resources/people.json . The mapping from scheme to org.apache.hadoop.fs.FileSystem implementations is maintained in hadoop configuration. For example

<property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3a.S3AFileSystem</value></property>

would map s3://... to S3AFileSystem and there are defaults for some common file systems in case they are not explicitly configured.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM