简体   繁体   English

如何在R中使用fread来获取bigquery导出的* .csv格式?

[英]How to use fread in R for bigquery's exported *.csv format?

I exported a very large dataset from google bigquery: 我从google bigquery导出了一个非常大的数据集:

  1. I saved my query result to a (new) bq-table 我将查询结果保存到(新)bq表中
  2. then exported that table as splitted *.csv (gzip compressed) to a bucket in GCS 然后将该表作为拆分* .csv(压缩gzip)导出到GCS中的存储桶
  3. finally downloaded these files locally by using gsutil -m cp -R gs://bucketname . 最后使用gsutil -m cp -R gs:// bucketname在本地下载这些文件。
  4. ... now I want to read those *.csv files in R(Studio)! ...现在我想读R(Studio)中的那些* .csv文件!

It works when I use read.csv: 它在我使用read.csv时有效:

tmp_file <- read.csv(path_to_csv_file)

Unfortunately, that's very slow, as we all know - therefore I want(ed) to use fread(): 不幸的是,这很慢,我们都知道 - 所以我想(ed)使用fread():

tmp_file <- fread(path_to_csv_file, verbose = TRUE)

But then it fails! 但后来失败了! Error output message: 错误输出消息:

omp_get_num_procs()==12
R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50)
R_DATATABLE_NUM_THREADS==""
omp_get_thread_limit()==2147483647
omp_get_max_threads()==12
OMP_THREAD_LIMIT==""
OMP_NUM_THREADS==""
data.table is using 6 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.
RestoreAfterFork==true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 6 threads (omp_get_max_threads()=12, nth=6)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file /000000000007.csv
  File opened, size = 377.0MB (395347735 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
  File ends abruptly with 'O'. Final end-of-line is missing. Using cow page to write 0 to the last byte.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  No sep and quote rule found a block of 2x2 or greater. Single column input.
  Detected 1 columns on line 1. This line is either column names or first data row. Line starts as: <<>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 1
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (395347735 bytes from row 1 to eof) / (2 * 3 jump0size) == 65891289
  Type codes (jump 000)    : 2  Quote rule 0
  A line with too-many fields (1/1) was found on line 4 of sample jump 2. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 2 of sample jump 4. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 2 of sample jump 7. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 10. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 12. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 14. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 2 of sample jump 16. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 18. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 2 of sample jump 20. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 23. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 25. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 3 of sample jump 28. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 4 of sample jump 30. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 33. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 41. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 3 of sample jump 48. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 4 of sample jump 57. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 58. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 59. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 65. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 2 of sample jump 69. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 5 of sample jump 70. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 2 of sample jump 72. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 74. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 2 of sample jump 75. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 79. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 80. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 83. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 85. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 86. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 3 of sample jump 89. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 94. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 96. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (1/1) was found on line 1 of sample jump 98. Most likely this jump landed awkwardly so type bumps here will be skipped.
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (bool8) in the rest of the 6626 sample rows
  =====
  Sampled 6626 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 395347732
  Line length: mean=1.30 sd=17.01 min=0 max=639
  Estimated number of rows: 395347732 / 1.30 = 304460027
  Initial alloc = 334906029 rows (304460027 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 2
[10] Allocate memory for the datatable
  Allocating 1 column slots (1 - 0 dropped) with 334906029 rows
[11] Read the data
  jumps=[0..378), chunk_size=1045893, total_size=395347732
Error in fread(all_csvs[i], integer64 = "character", verbose = TRUE) : 
  Internal error: invalid head position. jump=1, headPos=0000000188EA0003, thisJumpStart=0000000188F9F5EA, sof=0000000188EA0000

When I open a *.csv, it shows hexadecimal encoding (if that helps). 当我打开* .csv时,它显示十六进制编码(如果这有帮助)。 (How) can I use fread for this task - or is there any (fast) alternative solution for importing those *.csv files (compared to read.csv)? (如何)我可以使用fread执行此任务 - 或者是否有任何(快速)替代解决方案用于导入这些* .csv文件(与read.csv相比)?

Best regards, david 最好的问候,大卫

The newly launched vroom package addresses this problem much better. 新推出的vroom包可以更好地解决这个问题。 vroom doesnt read the entire file at once. vroom不会立即读取整个文件。 It uses the Altrep framework to lazily load the data. 它使用Altrep框架来延迟加载数据。 It also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance. 它还使用多个线程进行索引,实现非字符列,并在编写时进一步提高性能。

Read Vroom Benchmark for the comparison. 阅读Vroom Benchmark进行比较。 It can read files at a speed of 900MB/sec 它可以以900MB/sec的速度读取文件

vroom uses the same interface as readr to specify column types. vroom使用相同的接口作为readr指定列类型。

vroom::vroom("mtcars.tsv",
  col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
                   drat = "_", vs = "l", am = "l", carb = "i")
)
#> # A tibble: 32 x 10
#>   model           mpg   cyl    hp    wt  qsec vs    am    gear   carb
#>   <chr>         <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl> <fct> <int>
#> 1 Mazda RX4      21       6   110  2.62  16.5 FALSE TRUE  4         4
#> 2 Mazda RX4 Wag  21       6   110  2.88  17.0 FALSE TRUE  4         4
#> 3 Datsun 710     22.8     4    93  2.32  18.6 TRUE  TRUE  4         1
#> # … with 29 more rows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM