![](/img/trans.png)
[英]Transpose DataFrame single row to column in Spark with scala
[英]spark scala each datasets output as a single row of dataframe
我在目錄中有多個.nt(NTriples)文件。 我想讀取每個數據集並將其各自的輸出值存儲在數據幀的單行中。
假設我有dataset1.nt,dataset2.nt,...,datasetn.nt。 使用以下代碼讀取每個數據集時:
val input = "src/main/resources/dataset1.nt"
val triplesRDD = NTripleReader.load(spark, JavaURI.create(input))
//NTripleReader reads .nt file and separates each line of dataset into subject, predicate and object
/* My code to output number of distinct subjects, predicates and blank subjects in a dataset */
假設dataset1提供以下輸出:
比方說,dataset2提供了以下輸出:
等等...
當我使用以下代碼讀取目錄中的所有文件時:
val input = "src/main/resources/*"
val triplesRDD = NTripleReader.load(spark, JavaURI.create(input))
它給了我以下輸出:
但是,我希望我的輸出是這樣的:
Distinct Subjects | Distinct Predicates | Blank Subjects
xxxx | yy | zzz
aaaaa | b | cc
... | ... | ...
請讓我知道如何實現所需的輸出。
提前致謝。
我正在回答我的問題。 我希望這可能對其他人有幫助
import java.io.File
//import other necessary packages
object abc {
var df1: DataFrame = _
var df2: DataFrame = _
var df3: DataFrame = _
def main(args: Array[String]):Unit =
{
//initializing the spark session locally
val spark = SparkSession.builder
.master("local[*]")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.appName("abc")
.getOrCreate()
// creates a list of all files in a directory:
def getListOfFiles(dir: String):List[File] =
{
val path = new File("path/to/directory/")
if (path.exists && path.isDirectory)
{
path.listFiles.filter(_.isFile).toList
}
else
{
List[File]()
}
}
val files = getListOfFiles("path/to/directory/")
val input = ""
for (input <- files)
{
// println(input)
val triplesRDD = NTripleReader.load(spark, JavaURI.create(input.toString()))
/*code to generate dataframe columns value*/
import spark.implicits._
if(input == files(0))
{
df3 = Seq(
(column1_value, column2_value, column3_value, column4_value, column5_value, column6_value)
).toDF("column1_name", "column2_name", "column3_name", "column4_name", "column5_name", "column6_name")
}
else
{
df1 = Seq(
(column1_value, column2_value, column3_value, column4_value, column5_value, column6_value)
).toDF("column1_name", "column2_name", "column3_name", "column4_name", "column5_name", "column6_name")
df2 = df3.union(df1)
df3 = df2
}
}
df3.show()
// import dataframe to .csv file
df3.coalesce(1).write
.option("header", "true")
.csv("path/to/directory/sample.csv")
spark.stop
}
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.