简体   繁体   English

在Spark中获取Hive表的HDFS位置

[英]Getting HDFS Location of Hive Table in Spark

I am trying to parse out the Location from Hive partitioned table in Spark using this query: 我正在尝试使用以下查询从Spark中的Hive分区表中解析出Location:

val dsc_table = spark.sql("DESCRIBE FORMATTED data_db.part_table")

I was not able to find any query or any other way in Spark to specifically select Location column from this query. 我无法在Spark中找到任何查询或任何其他方式来专门从此查询中选择“位置”列。

You can use spark's utility of table reading: 您可以使用spark的表格读取实用程序:

spark.read.table("myDB.myTable").select(input_file_name).take(1) Will result in a string like: spark.read.table("myDB.myTable").select(input_file_name).take(1)将导致类似以下的字符串:

19/06/18 09:59:55 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res1: Array[org.apache.spark.sql.Row] = Array([hdfs://nameservice1/my/path/to/table/store/part-00000-d439163c-9fc4-4768-8b0b-c963f5f7d3d2.snappy.parquet])

I used take(1) only to print one row to show the result here. 我只使用take(1)打印一行以在此处显示结果。 You may not want to use it if you want all the locations. 如果需要所有位置,则可能不想使用它。 From this result you can parse the string accordingly in case you want only the location part. 根据此结果,可以仅在需要位置部分的情况下相应地解析字符串。

df.inputFiles method in dataframe API will print file path. 数据框API中的df.inputFiles方法将打印文件路径。 It returns a best-effort snapshot of the files that compose this DataFrame. 它返回组成此DataFrame的文件的最大努力快照。

spark.read.table("DB.TableName").inputFiles
Array[String]: = Array(hdfs://test/warehouse/tablename)

You can also use .toDF method on desc formatted table then filter from dataframe. 您还可以在desc formatted table上使用.toDF方法,然后从数据.toDF进行过滤。

DataframeAPI:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString

Result:

String = hdfs://nn:8020/location/part_table

(or)

RDD Api:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc

Result:

String = /location/part_table

没有在Pyspark找到答案

table_location = spark.sql("describe formatted DB.TableName").filter((F.col('col_name')=='Location')).select("data_type").toPandas().astype(str)['data_type'].values[0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM