[英]How to read a large table (>100 columns (variables) and 100,000 observations) from SQL Server into R using ODBC package
I'm getting an error to read large table into R from SQL Server.从 SQL 服务器将大表读入 R 时出现错误。
Here is my connection code:这是我的连接代码:
library(odbc)
library(DBI)
con <- dbConnect(odbc::odbc(),
.connection_string = 'driver={SQL Server};server=DW01;database=SFAF_DW;trusted_connection=true')
Here is a schema of my table which has 149 variables:这是我的表的架构,其中包含 149 个变量:
data1 <- dbGetQuery(con, "SELECT * FROM [eCW].[Visits]")
I got an error from this code probably because of large table.我从这段代码中得到一个错误,可能是因为表很大。
I would like to reduce the large table (number of observations) applying "VisitDateTime" variable.我想减少应用“VisitDateTime”变量的大表(观察次数)。
data2 <- dbGetQuery(con, "SELECT cast(VisitDateTime as DATETIME) as VisitDateTime FROM [eCW].[Visits] WHERE VisitDateTime>='2019-07-01 00:00:00' AND VisitDateTime<='2020-06-30 12:00:00'")
This code selected only "VisitDateTime" variable but I would like to get all (149 variables) from the table.此代码仅选择“VisitDateTime”变量,但我想从表中获取所有(149 个变量)。
Hoping to get some efficient codes.希望得到一些高效的代码。 Greatly appreciate your help on this.非常感谢您对此的帮助。 Thank you.谢谢你。
According to your schema, you have many variable length types, varchar
, of 255 character lengths.根据您的架构,您有许多长度为 255 个字符的可变长度类型varchar
。 As multiple answers on the similar error post suggests, you cannot rely on arbitrary order with SELECT *
but must explicitly reference each column and place variable lengths toward the end of SELECT
clause.正如类似错误帖子上的多个答案所暗示的那样,您不能依赖SELECT *
的任意顺序,但必须明确引用每一列并将可变长度SELECT
子句的末尾。 In fact, generally in application code running SQL, avoid SELECT * FROM
.事实上,一般在运行 SQL 的应用程序代码中,避免SELECT * FROM
。 See Why is SELECT * considered harmful?请参阅为什么 SELECT * 被认为是有害的?
Fortunately, from your schema output using INFORMATION_SCHEMA.COLUMNS
you can dynamically develop such a larger named list for SELECT
.幸运的是,从您的模式 output 使用INFORMATION_SCHEMA.COLUMNS
,您可以为SELECT
动态开发这样一个更大的命名列表。 First, adjust and run your schema query as an R data frame with a calculated column to order smallest to largest types and their precision/lengths.首先,调整并运行您的架构查询作为 R 数据框,其中包含一个计算列以从最小到最大的类型及其精度/长度排序。
schema_sql <- "SELECT sub.TABLE_NAME, sub.COLUMN_NAME, sub.DATA_TYPE, sub.SELECT_TYPE_ORDER
, sub.CHARACTER_MAXIMUM_LENGTH, sub.CHARACTER_OCTET_LENGTH
, sub.NUMERIC_PRECISION, sub.NUMERIC_PRECISION_RADIX, sub.NUMERIC_SCALE
FROM
(SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH, CHARACTER_OCTET_LENGTH
, NUMERIC_PRECISION, NUMERIC_PRECISION_RADIX, NUMERIC_SCALE
, CASE DATA_TYPE
WHEN 'tinyint' THEN 1
WHEN 'smallint' THEN 2
WHEN 'int' THEN 3
WHEN 'bigint' THEN 4
WHEN 'date' THEN 5
WHEN 'datetime' THEN 6
WHEN 'datetime2' THEN 7
WHEN 'decimal' THEN 8
WHEN 'varchar' THEN 9
WHEN 'nvarchar' THEN 10
END AS SELECT_TYPE_ORDER
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
) sub
ORDER BY sub.SELECT_TYPE_ORDER
, sub.NUMERIC_PRECISION
, sub.NUMERIC_PRECISION_RADIX
, sub.NUMERIC_SCALE
, sub.CHARACTER_MAXIMUM_LENGTH
, sub.CHARACTER_OCTET_LENGTH"
visits_schema_df <- dbGetQuery(con, schema_sql)
# BUILD COLUMN LIST FOR SELECT CLAUSE
select_columns <- paste0("[", paste(visits_schema_df$COLUMN_NAME, collapse="], ["), "]")
# RUN QUERY WITH EXPLICIT COLUMNS
data <- dbGetQuery(con, paste("SELECT", select_columns, "FROM [eCW].[Visits]"))
Above may need adjustment if same error arises.如果出现同样的错误,以上可能需要调整。 Be proactive and test on your end by isolating the problem columns, column types, etc. A few suggestions include filtering out DATA_TYPE
, COLUMN_NAME
or moving around ORDER
columns in schema query.通过隔离问题列、列类型等来积极主动地进行测试。一些建议包括过滤掉DATA_TYPE
、 COLUMN_NAME
或在模式查询中移动ORDER
列。
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND DATA_TYPE IN ('tinyint', 'smallint', 'int') -- TEST WITH ONLY INTEGER TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT DATA_TYPE IN ('varchar', 'nvarchar') -- TEST WITHOUT VARIABLE STRING TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT DATA_TYPE IN ('decimal', 'datetime2') -- TEST WITHOUT HIGH PRECISION TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT COLUMN_NAME IN ('LastHIVTestResult') -- TEST WITHOUT LARGE VARCHARs
...
ORDER BY sub.SELECT_TYPE_ORDER -- ADJUST ORDERING
, sub.NUMERIC_SCALE
, sub.NUMERIC_PRECISION
, sub.NUMERIC_PRECISION_RADIX
, sub.CHARACTER_OCTET_LENGTH
, sub.CHARACTER_MAXIMUM_LENGTH
Still another solution is to stitch the R data frame together by their types (adjusting schema query) using the chain merge on the primary key (assumed to be DW_Id
):另一种解决方案是使用主键(假定为DW_Id
)上的链式合并按类型(调整模式查询)将 R 数据框拼接在一起:
final_data <- Reduce(function(x, y) merge(x, y, by="DW_Id"),
list(data_int_columns, # SEPARATE QUERY RESULT WITH DW_Id AND INTs IN SELECT
data_num_columns, # SEPARATE QUERY RESULT WITH DW_Id AND DECIMALs IN SELECT
data_dt_columns, # SEPARATE QUERY RESULT WITH DW_Id AND DATE/TIMEs IN SELECT
data_char_columns) # SEPARATE QUERY RESULT WITH DW_Id AND VARCHARs IN SELECT
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.