[英]How to count the number of non-empty fields in a delimited file?
You can count the number of fields per line in a comma/tab/whatever delimited text file using utils::count.fields
. 您可以使用
utils::count.fields
逗号/制表符/任何分隔文本文件中每行的字段数。
Here's a reproducible example: 这是一个可重复的例子:
d <- data.frame(
x = c(1, NA, 3, NA, 5),
y = c(NA, "b", "c", NA, NA),
z = c(NA, "beta", "gamma", NA, "epsilon")
)
fname <- "test.csv"
write.csv(d, fname, na = "", row.names = FALSE)
count.fields(fname, sep = ",")
## [1] 3 3 3 3 3 3
I want to calculate the number of non-empty fields per line. 我想计算每行的非空字段数。 I can do this in a clunky way by reading in everything and counting the number of values that aren't
NA
. 我可以通过读取所有内容并计算非
NA
值的数量,以笨重的方式做到这一点。
d2 <- read.csv(fname, na.strings = "")
rowSums(!is.na(d2))
## [1] 1 2 3 0 2
I'd really like a way of scanning the file (like count.fields
) so I can target specific sections to read in. 我真的很喜欢扫描文件的方法(比如
count.fields
),所以我可以针对特定的部分进行读取。
Is there a better way of counting the number of non-empty fields in a delimited file? 有没有更好的方法来计算分隔文件中的非空字段数?
This should be completely portable provided you have the Rcpp
& BH
packages installed: 如果您安装了
Rcpp
& BH
软件包,这应该是完全可移植的:
library(Rcpp)
library(inline)
csvblanks <- '
string data = as<string>(filename);
ifstream fil(data.c_str());
if (!fil.is_open()) return(R_NilValue);
typedef tokenizer< escaped_list_separator<char> > Tokenizer;
vector<string> fields;
vector<int> retval;
string line;
while (getline(fil, line)) {
int numblanks = 0;
Tokenizer tok(line);
for(Tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg){
numblanks += (beg->length() == 0) ? 1 : 0 ;
};
retval.push_back(numblanks);
}
return(wrap(retval));
'
count_blanks <- rcpp(
signature(filename="character"),
body=csvblanks,
includes=c("#include <iostream>",
"#include <fstream>",
"#include <vector>",
"#include <string>",
"#include <algorithm>",
"#include <iterator>",
"#include <boost/tokenizer.hpp>",
"using namespace Rcpp;",
"using namespace std;",
"using namespace boost;")
)
Once that's sourced you can call count_blanks(FULLPATH)
and it will return a numeric vector of counts of blank fields per line. 一旦获得该源,您可以调用
count_blanks(FULLPATH)
,它将返回每行空白字段计数的数字向量。
I ran it against this file: 我针对这个文件运行它:
"DATE","APIKEY","FILENAME","LANGUAGE","JOBID","TRANSCRIPT"
1,2,3,4,5
1,,3,4,5
1,2,3,4,5
1,2,,4,5
1,2,3,4,5
1,2,3,,5
1,2,3,4,5
1,2,3,4,
1,2,3,4,5
1,,3,,5
1,2,3,4,5
,2,,4,
1,2,3,4,5
via: 通过:
count_blanks("/tmp/a.csv")
## [1] 0 0 1 0 1 0 1 0 1 0 2 0 3 0
CAVEATS CAVEATS
header
logical parameter with associated C/C++ code (which will be pretty straightforward). header
逻辑参数(这将非常简单)。 [:space:]+
) as "empty" you'll need something a bit more complex than the call to length
. [:space:]+
)计为“空”,则需要比调用length
更复杂的东西。 This is one potential way to deal with it if you need to. escaped_list_separator
which is defined here . escaped_list_separator
的默认配置。 That can also be customized with with quote & separator characters (making it possible to further mimic read.csv
/ read.table
. read.csv
/ read.table
。 This will more closely approach count.fields
/ C_countfields
performance and will eliminate the need to consume memory by reading in every line just to find the lines you eventually want to more optimally target. 这将更接近
count.fields
/ C_countfields
性能,并且通过读取每一行来消除消耗内存的需要,只是为了找到最终想要更优化目标的行。 I don't think preallocating space for the returned numeric vector will add much to the speed, but you can see the discussion here which shows how to do so if need be. 我不认为为返回的数字向量预分配空间会增加速度,但是你可以在这里看到讨论,如果需要的话,它会显示如何做。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.