[英]SQL concatenate many tsv files into single table in a database, while keeping track of file source (MonetDBLite)
I am using the MonetDBLite R package to create a MonetDB. 我正在使用MonetDBLite R包来创建MonetDB。 I can create database tables just fine using the instructions from here , with the following code:
我可以使用此处的说明创建数据库表,使用以下代码:
library(DBI)
library(MonetDBLite)
# Write tsv file of mtcars
write.table(mtcars, "mtcars.tsv", row.names=FALSE, sep= "\t")
# Initialize MonetDB
dbdir <- "/Users/admin/my_directory"
con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)
# Write table
dbWriteTable(con, "test4", "mtcars.tsv", delim="\t")
and the following query gives 以下查询给出
> dbGetQuery(con, "SELECT * FROM test4 LIMIT 3")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
So far so good. 到现在为止还挺好。 But, say I have another file mtcars2 with different mpg values:
但是,假设我有另一个文件mtcars2具有不同的mpg值:
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg + 5
write.table(mtcars2, "mtcars2.tsv", row.names= FALSE, sep = "\t")
I can load it to another table: 我可以将它加载到另一个表:
dbWriteTable(con, "test5", "mtcars2.tsv", delim = "\t")
> dbGetQuery(con, "SELECT * FROM test5 LIMIT 3")
mpg cyl disp hp drat wt qsec vs am gear carb
1 26.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 26.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 27.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Also fine. 还好。 But my problem is this: I want to later on look up the
mpg
for all cars with 6 cyl
, and know which dataset it came from (mtcars or mtcars2). 但我的问题是:我想稍后查看所有6
cyl
汽车的mpg
,并知道它来自哪个数据集(mtcars或mtcars2)。 From what I understand of SQL indexing (which is not a lot and basically what I've read here ), I should have all my data in one table to have the most efficient searches. 根据我对SQL索引的理解(这不是很多,基本上我在这里读到的),我应该在一个表中拥有所有数据以获得最有效的搜索。 I tried loading the first tsv file, then added another column using
ALTER TABLE test4 ADD dataset TEXT
and UPDATE test4 SET dataset = dataset1
sql commands- 我尝试加载第一个tsv文件,然后使用
ALTER TABLE test4 ADD dataset TEXT
添加另一个列ALTER TABLE test4 ADD dataset TEXT
和UPDATE test4 SET dataset = dataset1
sql命令 -
dbSendQuery(con, "UPDATE test4 SET dataset = dataset1")
dbSendQuery(con, "UPDATE test4 SET dataset = 1")
> dbGetQuery(con, "SELECT * FROM test4 LIMIT 3")
mpg cyl disp hp drat wt qsec vs am gear carb dataset
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
but then when I tried to append mtcars2 to the table, it had a different number of columns (as I should have expected, duh). 但是当我试图将mtcars2附加到表中时,它有不同数量的列(正如我应该预料的那样)。 What's the best way to concatenate data from many tsv files with identical columns into a single table, while keeping track of the data's source?
什么是将具有相同列的多个tsv文件中的数据连接到单个表中的最佳方法,同时跟踪数据的来源?
EDIT- as you might have guessed, the real data is not mtcars- it's flat tsv files millions of lines long, meaning I want to avoid reading the whole file into memory and manipulating with R. 编辑 - 您可能已经猜到,真正的数据不是mtcars-它是平坦的tsv文件数百万行,这意味着我想避免将整个文件读入内存并使用R进行操作。
Following xQbert 's suggestion, I solved using SQL commands only (necessary and faster than bash commands, considering my data is 10s of files, each millions of lines long). 根据xQbert的建议,我只解决了使用SQL命令的问题(必要且比bash命令更快,考虑到我的数据是10个文件,每个数百万行长)。
library(DBI)
library(MonetDBLite)
# Write tsv file of mtcars
write.table(mtcars, "mtcars.tsv", row.names=FALSE, sep= "\t")
# Write tsv of second mtcars
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg + 5
write.table(mtcars2, "mtcars2.tsv", row.names= FALSE, sep = "\t")
# Initialize MonetDB
dbdir <- "/Users/admin/"
con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)
# Write table
dbWriteTable(con, "test4", "mtcars.tsv", delim="\t")
# Add data source information
dbSendQuery(con, "ALTER TABLE test4 ADD source TEXT")
dbSendQuery(con, "UPDATE test4 SET source = 'dataset1'")
# Write second dataset to a temporary table
dbWriteTable(con, "temptable", "mtcars2.tsv", delim="\t")
# Add data source information
dbSendQuery(con, "ALTER TABLE temptable ADD source TEXT")
dbSendQuery(con, "UPDATE temptable SET source = 'dataset2'")
# Insert temp table into main table
dbSendQuery(con, "INSERT INTO test4 SELECT * FROM temptable")
# Drop temp table
dbSendQuery(con, "DROP TABLE temptable")
# Checking the data, truncated for clarity
> dbGetQuery(con, "SELECT * FROM test4")
mpg cyl disp hp drat wt qsec vs am gear carb source
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 dataset1
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 dataset1
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 dataset1
...
33 26.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 dataset2
34 26.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 dataset2
35 27.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 dataset2
...
64 26.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 dataset2
Sorry if I didn't make it clear enough in the question that my data is much larger than mtcars- if you have medium sized data, the data.tables
package is probably a better solution than a database. 很抱歉,如果我的数据比mtcars大得多,我没有说清楚 - 如果你有中等大小的数据,
data.tables
包可能是一个比数据库更好的解决方案。
You should be able to do what you want with executing dbWriteTable()
after reading the file create a new variable in the data.frame. 在读取文件后,您应该能够执行
dbWriteTable()
,在data.frame中创建一个新变量。 Something like: 就像是:
library(DBI)
library(MonetDBLite)
library(data.table)
# Write tsv file of mtcars
tmp <- tempfile()
write.table(mtcars, tmp, row.names=FALSE, sep= "\t")
# Initialize MonetDB
dbdir <- "~/Desktop/temp"
con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)
test4df <- fread(tmp)
test4df$dataset <- 1
dbWriteTable(con, "test4", test4df)
dbReadTable(con, "test4")
test5df <- fread(tmp)
test5df$mpg <- test5df$mpg + 5
test5df$dataset <- 2
dbWriteTable(con, "test4", test5df, append = TRUE)
dbReadTable(con, "test4")
Edit (Suggestion on the way without opening the file) 编辑(没有打开文件的方式建议)
If you want to the work without opening a file in one time, you can do something like this to modify the file and attach another field. 如果您想在不打开文件的情况下完成工作,可以执行以下操作来修改文件并附加其他字段。 As I wrote this will work with an OS with
bash
. 正如我写的那样,这将适用于带有
bash
的操作系统。
infile <- tmp
outfile <- tempfile()
# open connections
incon <- file(description = infile, open = "r")
outcon <- file(description = outfile, open = "w")
# count the number of lines (this will work only with Mac/Linux)
com <- paste("wc -l ", infile, " | awk '{ print $1 }'", sep="")
n <- system(command=com, intern=TRUE)
# work with the first line
txt <- scan(file = incon, what = character(), nlines=1, quiet=TRUE)
txt <- c(txt, "dataset")
cat(paste(txt, collapse = "\t"), "\n", file = outcon, sep = "")
# work with the rest of the file
for(i in 2:n) {
txt <- scan(file = incon, what = character(), nlines=1, quiet=TRUE)
txt <- c(txt, "1")
cat(paste(txt, collapse = "\t"), "\n", file = outcon, sep = "")
}
close(incon);close(outcon)
dbWriteTable(con, "test4", outfile, delim = "\t")
# do the similar for other files
Here is what I would do, given a set of files with the same structure and file names desired in the final table, which is otherwise a combination of the data from all files: 这是我要做的,给定一组具有相同结构和最终表中所需文件名的文件,否则它们是所有文件中数据的组合:
# say we have those files
write.table(mtcars, "mtcars1.tsv", row.names=FALSE, sep= "\t")
write.table(mtcars, "mtcars2.tsv", row.names=FALSE, sep= "\t")
# write them individually, and add a column that contains the file name
dbWriteTable(con, "mtcars1", "mtcars1.tsv", delim="\t")
dbSendQuery(con, "ALTER TABLE mtcars1 ADD COLUMN file STRING DEFAULT 'mtcars1.tsv';")
dbWriteTable(con, "mtcars2", "mtcars2.tsv", delim="\t")
dbSendQuery(con, "ALTER TABLE mtcars2 ADD COLUMN file STRING DEFAULT 'mtcars2.tsv';")
# now combine into a new table
dbSendQuery(con, "CREATE TABLE mtcars_mat AS SELECT * FROM mtcars1 UNION ALL SELECT * FROM mtcars2")
# or a view if you don't need to modify the data in the mtcars table (faster)
dbSendQuery(con, "CREATE view mtcars AS SELECT * FROM mtcars1 UNION ALL SELECT * FROM mtcars2")
# and here is the same as a loop with a filename glob and some added robustness (handy if you have 1000 files)
files <- Sys.glob("/some/path/mtcars*.tsv")
tables <- dbQuoteIdentifier(con, tools::file_path_sans_ext(basename(files)))
dbBegin(con)
for (i in 1:length(files)) {
dbWriteTable(con, tables[i], files[i], delim="\t", transaction=FALSE)
dbSendQuery(con, paste0("ALTER TABLE ", tables[i], " ADD COLUMN file STRING DEFAULT ",dbQuoteString(con, files[i]),";"))
}
dbSendQuery(con, paste0("CREATE TABLE somefinalresult AS ", paste0("SELECT * FROM ",tables, collapse=" UNION ALL ")))
# remove the parts again, optional
dbSendQuery(con, paste0("DROP TABLE ", tables, ";", collapse=" "))
dbCommit(con)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.