一起使用Azure Data Lakes和R Server时等效于readLines

Question

使用R Server，我只想从Azure数据湖中读取原始文本（如基本的readLines）。 我可以像这样连接并获取数据：

library(RevoScaleR)

rxSetComputeContext("local")

oAuth <- rxOAuthParameters(params)
hdFS <- RxHdfsFileSystem(params)

file1 <- RxTextData("/path/to/file.txt", fileSystem = hdFS)

一旦执行该行， RxTextData实际上就不会获取数据，它更像是一个符号链接。 当您运行类似：

rxSummary(~. , data=file1)

然后，从数据湖中检索数据。 但是，始终将其读入并视为定界文件。 我要：

下载文件并使用R代码将其存储在本地（最好不是）。
使用某种等效的readLines从中获取数据，但以“原始”方式读取数据，以便我自己进行数据质量检查。

这个功能还存在吗？ 如果是这样，这是怎么做的？

编辑：我也尝试过：

returnDataFrame = FALSE

在RxTextData内部。 这将返回一个列表。 但是正如我已经说过的那样，直到我运行rxSummary东西rxSummary ，它才会立即从数据湖中读取数据，然后尝试将其作为常规文件读取。

上下文：我有一个“坏” CSV文件，其中包含双引号内的换行符。 这导致RxTextData中断。 但是，我的脚本检测到了这些情况，并相应地进行了修复。 因此，我不希望RevoScaleR读入数据并尝试解释定界符。

Answer 1

我找到了一种方法，可以通过调用Azure Data Lake Store REST API（改编自GitHub上的Hadley Wickham的httr包的演示）来实现：

library(httpuv)
library(httr)

# 1. Insert the app name ----
app_name <- 'Any name'

# 2. Insert the client Id ----
client_id <- 'clientId'

# 3. API resource URI ----
resource_uri <- 'https://management.core.windows.net/'

# 4. Obtain OAuth2 endpoint settings for azure. ----
azure_endpoint <- oauth_endpoint(
    authorize = "https://login.windows.net/<tenandId>/oauth2/authorize",
    access = "https://login.windows.net/<tenandId>/oauth2/token"
    )

# 5. Create the app instance ----
myapp <- oauth_app(
  appname = app_name,
  key = client_id,
  secret = NULL
  )

# 6. Get the token ----
mytoken <- oauth2.0_token(
    azure_endpoint, 
    myapp,
    user_params = list(resource = resource_uri),
    use_oob = FALSE,
    as_header = TRUE,
    cache = FALSE
    )

# 7. Get the file. --------------------------------------------------------
test <- content(GET(
      url = "https://accountName.azuredatalakestore.net/webhdfs/v1/<PATH>?op=OPEN",
      add_headers(
        Authorization = paste("Bearer", mytoken$credentials$access_token),
        `Content-Type` = "application/json"
        )
  )) ## Returns as a binary body.

df <- fread(readBin(test, "character")) ## use readBin to convert to text.

Answer 2

您可以像这样使用ScaleR函数来做到这一点。 将定界符设置为数据中不会出现的字符，并忽略列名。 这将创建一个包含单个字符列的数据框，您可以根据需要进行操作。

# assuming that ASCII 0xff/255 won't occur
src <- RxTextData("file", fileSystem="hdfs", delimiter="\x255", firstRowIsColNames=FALSE)

dat <- rxDataStep(src)

尽管考虑到Azure Data Lake实际上是用于存储大型数据集的，并且这个数据似乎足够小以适合内存，但我想知道为什么您不能只将其复制到本地磁盘中。

一起使用Azure Data Lakes和R Server时等效于readLines

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-02-16 11:04:59

解决方案2
1 2017-02-16 12:25:25

一起使用Azure Data Lakes和R Server时等效于readLines

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-02-16 11:04:59

解决方案2 1 2017-02-16 12:25:25

解决方案1
1 已采纳 2017-02-16 11:04:59

解决方案2
1 2017-02-16 12:25:25