R中的连接究竟是什么？

Question

I've read through and successfully use ?connections in R but I really don't understand what they are. 我已经阅读并成功使用了R中的?connections ，但我真的不明白它们是什么。

I get that I can download a file, read and write a compressed file, ... ( that is I understand what the result of using a conection (open, do stuff, close) but I really don't understand what they actually do, why you have to open and close them and so on ). 我知道我可以下载文件，读取和写入压缩文件，...（ 这是我理解使用连接的结果（打开，做东西，关闭），但我真的不明白他们实际做了什么，为什么你必须打开和关闭它们等等 ）。

I'm hoping this will also help me understand how to more effectively use them (principally understand the mechanics of what is happening so I can effectively debug when something is not working). 我希望这也能帮助我理解如何更有效地使用它们（主要是理解正在发生的事情的机制，以便在有些事情不能正常工作时我能有效地进行调试）。

Answer 1

Connections were introduced in R 1.2.0 and described by Brian Ripley in the first issue of R NEWS (now called The R Journal) of January 2001 (page 16-17) as an abstracted interface to IO streams such as a file, url, socket, or pipe. 连接在R 1.2.0中引入，并由Brian Ripley在2001年1月的第一期R NEWS（现称为The R Journal）（第16-17页）中描述为IO流的抽象接口，例如文件，URL，插座或管道。 In 2013, Simon Urbanek added a Connections.h C API which enables R packages to implement custom connection types, such as the curl package. 2013年，Simon Urbanek添加了一个Connections.h C API，它使R包能够实现自定义连接类型，例如curl包。

One feature of connections is that you can incrementally read or write pieces of data from/to the connection using the readBin , writeBin , readLines and writeLines functions. 连接的一个特性是您可以使用readBin ， writeBin ， readLines和writeLines函数从连接中逐步读取或写入数据。 This allows for asynchronous data processing, for example when dealing with large data or network connections: 这允许异步数据处理，例如在处理大数据或网络连接时：

# Read the first 30 lines, 10 lines at a time
con <- url("http://jeroen.github.io/data/diamonds.json") 
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
close(con)

Same for writing, eg to a file: 写入相同，例如写入文件：

tmp <- file(tempfile())
open(tmp, "w")
writeLines("A line", tmp)
writeLines("Another line", tmp)
close(tmp)

Open the connection as rb or wb to read/write binary data (called raw vectors in R): 以rb或wb打开连接以读取/写入二进制数据（在R中称为原始向量）：

# Read the first 3000 bytes, 1000 bytes at a time
con <- url("http://jeroen.github.io/data/diamonds.json") 
open(con, "rb")
data1 <- readBin(con, raw(), n = 1000)
data2 <- readBin(con, raw(), n = 1000)
data3 <- readBin(con, raw(), n = 1000)
close(con)

The pipe() connection is used to run a system command and pipe text to stdin or from stdout as you would do with the | pipe()连接用于将系统命令和管道文本运行到stdin或stdout就像使用| stdout一样| operator in a shell. shell中的运算符。 Eg (lets stick with the curl examples), you can run the curl command line program and pipe the output to R: 例如（让我们坚持使用curl示例），您可以运行curl命令行程序并将输出传递给R：

con <- pipe("curl -H 'Accept: application/json' https://jeroen.github.io/data/diamonds.json")
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)

Some aspects of connections are a bit confusing: to incrementally read/write data you need to explicitly open() and close() the connection. 连接的某些方面有点令人困惑：要逐步读/写数据，您需要显式open()和close()连接。 However, readLines and writeLines automatically open and close (but not destroy!) an unopened connection. 但是， readLines和writeLines自动打开和关闭（但不会销毁！）未打开的连接。 As a result, the example below will read the first 10 lines over and over again which is not very useful: 因此，下面的示例将一遍又一遍地读取前10行，这不是很有用：

con <- url("http://jeroen.github.io/data/diamonds.json") 
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
identical(data1, data2)

Another gotcha is that the C API can both close and destroy a connection, but R only exposes a function called close() which actually means destroy . 另一个问题是C API可以关闭和销毁连接，但R只暴露一个名为close()的函数，这实际上意味着销毁 。 After calling close() on a connection it is destroyed and completely useless. 在连接上调用close() ，它被销毁并且完全没用。

To stream-process data form a connection you want to use a pattern like this: 要从连接中流式处理数据，您需要使用如下模式：

stream <- function(){
  con <- url("http://jeroen.github.io/data/diamonds.json")
  open(con, "r")
  on.exit(close(con))
  while(length(txt <- readLines(con, n = 10))){
    some_callback(txt)
  } 
}

The jsonlite package relies heavily on connections to import/export ndjson data: jsonlite包在很大程度上依赖于导入/导出ndjson数据的连接：

library(jsonlite)
library(curl)
diamonds <- stream_in(curl("https://jeroen.github.io/data/diamonds.json"))

The streaming (by default 1000 lines at a time) makes it fast and memory efficient: 流式传输（默认情况下，每次1000行）使其快速且内存效率高：

library(nycflights13)
stream_out(flights, file(tmp <- tempfile()))
flights2 <- stream_in(file(tmp))
all.equal(flights2, as.data.frame(flights))

Finally one nice feature about connections is that the garbage collector will automatically close them if you forget to do so, with an annoying warning: 最后一个关于连接的一个很好的功能是，如果你忘了这么做，垃圾收集器会自动关闭它们，并带有恼人的警告：

con <- file(system.file("DESCRIPTION"), open = "r")
rm(con)
gc()

R中的连接究竟是什么？

问题描述

1 个解决方案

解决方案1
70 已采纳 2015-05-25 21:25:41

R中的连接究竟是什么？

问题描述

1 个解决方案

解决方案1 70 已采纳 2015-05-25 21:25:41

解决方案1
70 已采纳 2015-05-25 21:25:41