[英]From R to Scala. How can I read a CSV in Scala, save it to a res slot and then combined those res slots into a sample csv?
我想從一個大型數據集中創建一個隨機樣本。 擁有6gb的Kaggle數據集
我在R中反復執行此操作的方式如下:
ctr <- read.csv("train.csv", nrow = 1000000) # importing by parts
ctr_sample <- sample(nrow(ctr), nrow(ctr)*0.1)
ctr_1 <- ctr[ctr_sample,] # creating 10% sample
nrow(ctr_1)
rm(ctr)
ctr2 <- read.csv("train.csv", header = TRUE,nrow = 1000000, skip = 1000000,
col.names = c("id","click", "hour","C1", "banner_pos", "site_id" , "site_domain",
"site_category", "app_id" , "app_domain" ,"app_category", "device_id", "device_ip",
"device_model","device_type","device_conn_type", "C14" , "C15", "C16", "C17", "C18",
"C19", "C20" , "C21"))
ctr_sample <- sample(nrow(ctr2), nrow(ctr2)*0.1)
ctr_2 <- ctr2[ctr_sample,]
nrow(ctr_2)
rm(ctr2)
}
我將繼續針對ctr2,ctr3 ...一直到ctr7,然后將它們與R中的rbind組合到ctr_set並將其導出到采樣csv。
但是在Scala中,我不知道如何從CSV中僅讀取1000行,然后跳過接下來的1000行,獲取0.1樣本,將其存儲到res0,1,2中,然后按行合並這些res並導出整個樣本作為PDF
兩種語言的人都好嗎? 我是一名學生,可以使用家教
要在Spark中執行rbind ,可以使用RDD.union函數,如下所示:
val ctr_1 = sc.textFile("ctr_1")
val ctr_2 = sc.textFile("ctr_2")
val combine = ctr_1.union(ctr_2)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.