简体   繁体   English

如何使用 ODBC package 从 SQL 服务器读取大表(>100 列(变量)和 100,000 个观察值)到 R

[英]How to read a large table (>100 columns (variables) and 100,000 observations) from SQL Server into R using ODBC package

I'm getting an error to read large table into R from SQL Server.从 SQL 服务器将大表读入 R 时出现错误。

Here is my connection code:这是我的连接代码:

library(odbc)
library(DBI)
con <- dbConnect(odbc::odbc(), 
     .connection_string = 'driver={SQL Server};server=DW01;database=SFAF_DW;trusted_connection=true')

Here is a schema of my table which has 149 variables:这是我的表的架构,其中包含 149 个变量:

在此处输入图像描述

data1 <- dbGetQuery(con, "SELECT * FROM [eCW].[Visits]")

I got an error from this code probably because of large table.我从这段代码中得到一个错误,可能是因为表很大。

I would like to reduce the large table (number of observations) applying "VisitDateTime" variable.我想减少应用“VisitDateTime”变量的大表(观察次数)。

data2 <- dbGetQuery(con, "SELECT cast(VisitDateTime as DATETIME) as VisitDateTime FROM [eCW].[Visits] WHERE VisitDateTime>='2019-07-01 00:00:00' AND VisitDateTime<='2020-06-30 12:00:00'")

This code selected only "VisitDateTime" variable but I would like to get all (149 variables) from the table.此代码仅选择“VisitDateTime”变量,但我想从表中获取所有(149 个变量)。

Hoping to get some efficient codes.希望得到一些高效的代码。 Greatly appreciate your help on this.非常感谢您对此的帮助。 Thank you.谢谢你。

According to your schema, you have many variable length types, varchar , of 255 character lengths.根据您的架构,您有许多长度为 255 个字符的可变长度类型varchar As multiple answers on the similar error post suggests, you cannot rely on arbitrary order with SELECT * but must explicitly reference each column and place variable lengths toward the end of SELECT clause.正如类似错误帖子上的多个答案所暗示的那样,您不能依赖SELECT *的任意顺序,但必须明确引用每一列并将可变长度SELECT子句的末尾 In fact, generally in application code running SQL, avoid SELECT * FROM .事实上,一般在运行 SQL 的应用程序代码中,避免SELECT * FROM See Why is SELECT * considered harmful?请参阅为什么 SELECT * 被认为是有害的?

Fortunately, from your schema output using INFORMATION_SCHEMA.COLUMNS you can dynamically develop such a larger named list for SELECT .幸运的是,从您的模式 output 使用INFORMATION_SCHEMA.COLUMNS ,您可以为SELECT动态开发这样一个更大的命名列表。 First, adjust and run your schema query as an R data frame with a calculated column to order smallest to largest types and their precision/lengths.首先,调整并运行您的架构查询作为 R 数据框,其中包含一个计算列以从最小到最大的类型及其精度/长度排序。

schema_sql <- "SELECT sub.TABLE_NAME, sub.COLUMN_NAME, sub.DATA_TYPE, sub.SELECT_TYPE_ORDER
                    , sub.CHARACTER_MAXIMUM_LENGTH, sub.CHARACTER_OCTET_LENGTH
                    , sub.NUMERIC_PRECISION, sub.NUMERIC_PRECISION_RADIX, sub.NUMERIC_SCALE
               FROM 
                  (SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE 
                        , CHARACTER_MAXIMUM_LENGTH, CHARACTER_OCTET_LENGTH
                        , NUMERIC_PRECISION, NUMERIC_PRECISION_RADIX, NUMERIC_SCALE
                        , CASE DATA_TYPE
                                WHEN 'tinyint'   THEN 1
                                WHEN 'smallint'  THEN 2
                                WHEN 'int'       THEN 3
                                WHEN 'bigint'    THEN 4
                                WHEN 'date'      THEN 5
                                WHEN 'datetime'  THEN 6
                                WHEN 'datetime2' THEN 7
                                WHEN 'decimal'   THEN 8
                                WHEN 'varchar'   THEN 9
                                WHEN 'nvarchar'  THEN 10
                          END AS SELECT_TYPE_ORDER
                   FROM INFORMATION_SCHEMA.COLUMNS
                   WHERE SCHEMA_NAME = 'eCW'
                     AND TABLE_NAME = 'Visits'
                  ) sub
               ORDER BY sub.SELECT_TYPE_ORDER
                      , sub.NUMERIC_PRECISION
                      , sub.NUMERIC_PRECISION_RADIX
                      , sub.NUMERIC_SCALE
                      , sub.CHARACTER_MAXIMUM_LENGTH
                      , sub.CHARACTER_OCTET_LENGTH"

visits_schema_df <- dbGetQuery(con, schema_sql)

# BUILD COLUMN LIST FOR SELECT CLAUSE
select_columns <- paste0("[", paste(visits_schema_df$COLUMN_NAME, collapse="], ["), "]")

# RUN QUERY WITH EXPLICIT COLUMNS
data <- dbGetQuery(con, paste("SELECT", select_columns, "FROM [eCW].[Visits]"))

Above may need adjustment if same error arises.如果出现同样的错误,以上可能需要调整。 Be proactive and test on your end by isolating the problem columns, column types, etc. A few suggestions include filtering out DATA_TYPE , COLUMN_NAME or moving around ORDER columns in schema query.通过隔离问题列、列类型等来积极主动地进行测试。一些建议包括过滤掉DATA_TYPECOLUMN_NAME或在模式查询中移动ORDER列。

...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND DATA_TYPE IN ('tinyint', 'smallint', 'int')  -- TEST WITH ONLY INTEGER TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT DATA_TYPE IN ('varchar', 'nvarchar')     -- TEST WITHOUT VARIABLE STRING TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT DATA_TYPE IN ('decimal', 'datetime2')    -- TEST WITHOUT HIGH PRECISION TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT COLUMN_NAME IN ('LastHIVTestResult')     -- TEST WITHOUT LARGE VARCHARs
...
ORDER BY sub.SELECT_TYPE_ORDER                         -- ADJUST ORDERING
       , sub.NUMERIC_SCALE                             
       , sub.NUMERIC_PRECISION
       , sub.NUMERIC_PRECISION_RADIX
       , sub.CHARACTER_OCTET_LENGTH
       , sub.CHARACTER_MAXIMUM_LENGTH

Still another solution is to stitch the R data frame together by their types (adjusting schema query) using the chain merge on the primary key (assumed to be DW_Id ):另一种解决方案是使用主键(假定为DW_Id )上的链式合并按类型(调整模式查询)将 R 数据框拼接在一起:

final_data <- Reduce(function(x, y) merge(x, y, by="DW_Id"),
                     list(data_int_columns,        # SEPARATE QUERY RESULT WITH DW_Id AND INTs IN SELECT
                          data_num_columns,        # SEPARATE QUERY RESULT WITH DW_Id AND DECIMALs IN SELECT 
                          data_dt_columns,         # SEPARATE QUERY RESULT WITH DW_Id AND DATE/TIMEs IN SELECT
                          data_char_columns)       # SEPARATE QUERY RESULT WITH DW_Id AND VARCHARs IN SELECT
              )

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 SQL:如何将 Oracle 表中的 100,000 条记录拆分为 5 个块? - SQL: How would you split a 100,000 records from a Oracle table into 5 chunks? 如何通过 c# 从 1400 万个表中删除 100,000 个特定行? - How can I delete 100,000 specific rows from a table of 14 million via c#? 大(600 万行)pandas df 当 chunksize =100 时会导致内存错误,使用 `to_sql`,但可以轻松保存 100,000 的文件而没有 chunksize - Large (6 million rows) pandas df causes memory error with `to_sql ` when chunksize =100, but can easily save file of 100,000 with no chunksize 在SQL中超过100,000行的整个列中进行过程循环 - Have a procedure loop through an entire column of over 100,000 rows in SQL 为什么快速 API 需要 10 分钟以上才能将 100,000 行插入 SQL 数据库 - Why does Fast API take upwards of 10 minutes to insert 100,000 rows into a SQL database 如何使用 ODBC 包从 R 更改 SQL 表的 vartypes? - How can I change vartypes of a SQL table from R using ODBC package? mysql 如何用 100,000 个不同的条件更新 where clouse 的事务 - mysql how to update transaction of where clouse with 100,000 different conditions 在 Redshift 中获取超过 100,000 条记录 - Fetch more than 100,000 records in Redshift 一次插入100,000行..错误标识列冲突 - Inserting 100,000 rows at a time .. error identity column violation Sql - 使用连接从大表 (100M+) 中获取记录时查询速度慢,提示? - Sql - Slow query when fetching records from large table (100M+) using join, tips?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM