R and dplyr: How can I use compute() to create a persistent table from SQL query in a different schema than the source schema?

Question

I have a question similar to this Stackoverflow post .

How can I create a persistent table from a SQL query in a database (I use a DB2 database)? My goal is to use a table from one schema and to permanently create a more or less modified table in another schema.

What works so far is to pull the data to R and subsequently create a table in a different schema:

dplyr::tbl(con, in_schema("SCHEMA_A", "TABLE")) %>%
collect() %>% 
DBI::dbWriteTable(con, Id(schema = "SCHEMA_B", table = "NEW_TABLE"), ., overwrite = TRUE)

However, I'd like to incorporate the compute() function in a dplyr pipeline such that I do not have to pull the data into R, that is, I'd like keep the data on the database. As a side note: I do not know how I would substitute the DBI 's dbWriteTable() for dplyr 's copy_to() – being able to do that would also help me.

Unfortunately, I am not able to make it work, even after reading ?compute() and its online documentation . The following code framework does not work and results in an error:

dplyr::tbl(con, in_schema("SCHEMA_A", "TABLE")) %>%
dplyr::compute(in_schema("SCHEMA_B", "NEW_TABLE"), analyze = FALSE, temporary = FALSE)

Is there a solution for using compute() or some other solution applicable to a dplyr pipeline?

Answer 1

I use a custom function that takes the SQL query behind a remote table, converts in into a query that can be executed on the SQL server to save a new table, and then executes that query using the DBI package. Key details below, full details (and other functions I find useful) in my GitHub repository here .

write_to_database <- function(input_tbl, db_connection, db, schema, tbl_name){
  # SQL query
  sql_query <- glue::glue("SELECT *\n",
                          "INTO {db}.{schema}.{tbl_name}\n",
                          "FROM (\n",
                          dbplyr::sql_render(input_tbl),
                          "\n) AS from_table")
  
  # run query
  DBI::dbExecute(db_connection, as.character(sql_query))
}

The essence of the idea is to construct an SQL query that if you executed it in your database language directly, would give you the desired outcome. In my application this takes the form:

SELECT *
INTO db.schema.table
FROM (
  /* sub query for existing table */
) AS alias

Note that this is using SQL server, and your particular SQL syntax might be different. INTO is the SQL server pattern for writing a table. In the example linked to in the question, the syntax is TO TABLE .

Answer 2

Thanks to @Simon.SA, I could solve my problem. As he showed in his reply, one can define a custom function and incorporate it in a dplyr pipeline. My adapted code looks like this:

# Custom function

write_to_database <- function(input_tbl, db_connection, schema, tbl_name){
  
  # SQL query

  sql_query <- glue::glue("CREATE TABLE {schema}.{tbl_name} AS (\n",
                      "SELECT * FROM (\n",
                      dbplyr::sql_render(input_tbl),
                      "\n)) WITH DATA;")

  # Drop table if it exists
  
  DBI::dbExecute(con, glue::glue("BEGIN\n",
                                    "IF EXISTS\n",
                                      "(SELECT TABNAME FROM SYSCAT.TABLES WHERE TABSCHEMA = '{schema}' AND TABNAME = '{tbl_name}') THEN\n",
                                        "PREPARE stmt FROM 'DROP TABLE {schema}.{tbl_name}';\n",
                                        "EXECUTE stmt;\n", 
                                    "END IF;\n",
                                 "END"))
  
  # Run query

  DBI::dbExecute(db_connection, as.character(sql_query))
}

# Dplyr pipeline

dplyr::tbl(con, in_schema("SCHEMA_A", "SOURCE_TABLE_NAME")) %>%
  dplyr::filter(VARIABLE == "ABC") %>% 
  show_query() %>% 
  write_to_database(., con, "SCHEMA_B", "NEW_TABLE_NAME")

It turns out that DB2 appears to not know DROP TABLE IF EXISTS such that some additional programming is necessary. I used this Stackoverflow post to get it done. Furthermore, in my case, I do not need to specify the database explicitly such that the parameter db in the custom function is left out.

R and dplyr: How can I use compute() to create a persistent table from SQL query in a different schema than the source schema?

Question

2 answers

solution1
2 ACCPTED 2020-06-21 22:22:51

solution2
2 2020-06-26 19:34:10

R and dplyr: How can I use compute() to create a persistent table from SQL query in a different schema than the source schema?

Question

2 answers

solution1 2 ACCPTED 2020-06-21 22:22:51

solution2 2 2020-06-26 19:34:10

solution1
2 ACCPTED 2020-06-21 22:22:51

solution2
2 2020-06-26 19:34:10