简体   繁体   English

在 Spark R 中读取制表符分隔的文本文件

[英]Read tab delimited text file in Spark R

I have a tab delimited file that is saved as a .txt with " " around the string variables.我有一个制表符分隔的文件,该文件保存为 .txt 文件,字符串变量周围带有“”。 The file can be found here .该文件可以在这里找到。

I am trying to read it into Spark-R (version 3.1.2), but cannot successfully bring it into the environment.我正在尝试将其读入 Spark-R(版本 3.1.2),但无法成功将其引入环境。 I've tried variations of the read.df code, like this:我尝试过read.df代码的变体,如下所示:

df <- read.df(path = "FILE.txt", header="True", inferSchema="True", delimiter = "\\t", encoding="ISO-8859-15")

df <- read.df(path = "FILE.txt", source = "txt", header="True", inferSchema="True", delimiter = "\\t", encoding="ISO-8859-15")

I have had success with bringing in CSVs with read.csv , but many of the files I have are over 10GB, and is not practical to convert them to CSV before bring them into Spark-R.我已经成功地使用read.csv引入了 CSV,但是我拥有的许多文件都超过 10GB,并且在将它们引入 Spark-R 之前将它们转换为 CSV 是不切实际的。

EDIT: When I run read.df I get a laundry list of errors, starting with this:编辑:当我运行read.df我得到了一个错误清单,从这个开始:

在此处输入图片说明

I am able to bring in csv files used in a previous project with both read.df and read.csv , so I don't think it's a java issue.我可以使用read.dfread.csv以前项目中使用的 csv 文件,所以我认为这不是 Java 问题。

If you don't need to specifically use Spark R, then base R read.table should work just fine for the .txt you provided.如果您不需要专门使用 Spark R,那么 base R read.table对于您提供的 .txt 应该可以正常工作。 Note that it is tab-delimited, and so this should be specified.请注意,它是制表符分隔的,因此应该指定它。

Something like this should work:这样的事情应该工作:

dat <- read.table("FILE.TXT",  
                  sep="\t",
                  header=TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM