[英]Getting HDFS Location of Hive Table in Spark
I am trying to parse out the Location from Hive partitioned table in Spark using this query: 我正在尝试使用以下查询从Spark中的Hive分区表中解析出Location:
val dsc_table = spark.sql("DESCRIBE FORMATTED data_db.part_table")
I was not able to find any query or any other way in Spark to specifically select Location column from this query. 我无法在Spark中找到任何查询或任何其他方式来专门从此查询中选择“位置”列。
You can use spark's utility of table reading: 您可以使用spark的表格读取实用程序:
spark.read.table("myDB.myTable").select(input_file_name).take(1)
Will result in a string like: spark.read.table("myDB.myTable").select(input_file_name).take(1)
将导致类似以下的字符串:
19/06/18 09:59:55 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res1: Array[org.apache.spark.sql.Row] = Array([hdfs://nameservice1/my/path/to/table/store/part-00000-d439163c-9fc4-4768-8b0b-c963f5f7d3d2.snappy.parquet])
I used take(1)
only to print one row to show the result here. 我只使用
take(1)
打印一行以在此处显示结果。 You may not want to use it if you want all the locations. 如果需要所有位置,则可能不想使用它。 From this result you can parse the string accordingly in case you want only the location part.
根据此结果,可以仅在需要位置部分的情况下相应地解析字符串。
df.inputFiles method in dataframe API will print file path. 数据框API中的df.inputFiles方法将打印文件路径。 It returns a best-effort snapshot of the files that compose this DataFrame.
它返回组成此DataFrame的文件的最大努力快照。
spark.read.table("DB.TableName").inputFiles
Array[String]: = Array(hdfs://test/warehouse/tablename)
You can also use .toDF
method on desc formatted table
then filter from dataframe. 您还可以在
desc formatted table
上使用.toDF
方法,然后从数据.toDF
进行过滤。
DataframeAPI:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString
Result:
String = hdfs://nn:8020/location/part_table
(or)
RDD Api:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc
Result:
String = /location/part_table
没有在Pyspark找到答案
table_location = spark.sql("describe formatted DB.TableName").filter((F.col('col_name')=='Location')).select("data_type").toPandas().astype(str)['data_type'].values[0]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.