![](/img/trans.png)
[英]readLines equivalent when using Azure Data Lakes and R Server together
[英]Parsing last comment line before data using readLines in R
我有一个很长的数据文件:
# Comment line 1
# Comment line 2
# ... many more lines
# values intensities
5.556667e+00 4.008450e+02
5.581000e+00 4.008770e+02
... many more values
# End comments
我想创建一个函数,在这个 object 上将提供:
[1] "values" "intensities"
你会建议我做什么?
readLines
可以读取数据,然后grep
注释字符。 在下面的 function 中,注释字符默认为问题的"#"
。
fun <- function(file, char = "#"){
x <- readLines(con = file)
y <- x[which(diff(grep(char, x)) != 1)]
unlist(strsplit(y, " "))[-1]
}
fun("filename.txt")
#[1] "values" "intensities"
如果您有一个长数据文件并且它不适合 memory 并且有awk
可用,则以下解决方案可以读取数据而不会出现 memory 问题。
read_awk <- function(file, char = "#"){
cmd <- "awk"
pattern <- paste0("/^[^", char, "]/")
awkcmd <- paste0("'", pattern, " {print NR - 1; exit 0}'")
args <- c(awkcmd, file)
out <- system2(command = cmd, args = args, stdout = TRUE)
as.integer(out)
}
fun_awk <- function(file, char = "#"){
n <- read_awk(file, char = char)
x <- scan(file = file, what = character(), sep = "\n", skip = n - 1, nlines = 1)
unlist(strsplit(x, " "))[-1]
}
fun_awk("filename.txt")
#Read 1 item
#[1] "values" "intensities"
"filename.txt"
是以下文件:
# Comment line 1
# Comment line 2
# ... many more lines
# values intensities
5.556667e+00 4.008450e+02
5.581000e+00 4.008770e+02
# End comments
根据列之间有多少空格,您可能希望在此处使用正则表达式:
data <- as.tibble(read.delim('test.txt', header = F))
data <- data[!startsWith(data$V1,'#'),] %>%
separate(V1, into = c('values', 'intensities'), sep = '\\s+')
data
# A tibble: 2 x 2
values intensities
<chr> <chr>
1 5.556667e+00 4.008450e+02
2 5.581000e+00 4.008770e+02
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.