如何计算分隔文件中的非空字段数？

Question

You can count the number of fields per line in a comma/tab/whatever delimited text file using utils::count.fields . 您可以使用utils::count.fields逗号/制表符/任何分隔文本文件中每行的字段数。

Here's a reproducible example: 这是一个可重复的例子：

d <- data.frame(
  x = c(1, NA, 3, NA, 5),
  y = c(NA, "b", "c", NA, NA),
  z = c(NA, "beta", "gamma", NA, "epsilon")
)

fname <- "test.csv"
write.csv(d, fname, na = "",  row.names = FALSE)
count.fields(fname, sep = ",")
## [1] 3 3 3 3 3 3

I want to calculate the number of non-empty fields per line. 我想计算每行的非空字段数。 I can do this in a clunky way by reading in everything and counting the number of values that aren't NA . 我可以通过读取所有内容并计算非NA值的数量，以笨重的方式做到这一点。

d2 <- read.csv(fname, na.strings = "")
rowSums(!is.na(d2))
## [1] 1 2 3 0 2

I'd really like a way of scanning the file (like count.fields ) so I can target specific sections to read in. 我真的很喜欢扫描文件的方法（比如count.fields ），所以我可以针对特定的部分进行读取。

Is there a better way of counting the number of non-empty fields in a delimited file? 有没有更好的方法来计算分隔文件中的非空字段数？

Answer 1

This should be completely portable provided you have the Rcpp & BH packages installed: 如果您安装了Rcpp ＆ BH软件包，这应该是完全可移植的：

library(Rcpp)
library(inline)

csvblanks <- '
string data = as<string>(filename);
ifstream fil(data.c_str());
if (!fil.is_open()) return(R_NilValue);

typedef tokenizer< escaped_list_separator<char> > Tokenizer;

vector<string> fields;
vector<int> retval;
string line;

while (getline(fil, line)) {
  int numblanks = 0;
  Tokenizer tok(line);
  for(Tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg){
    numblanks += (beg->length() == 0) ? 1 : 0 ;
  };
  retval.push_back(numblanks);
}
return(wrap(retval));
'

count_blanks <- rcpp(
  signature(filename="character"),
  body=csvblanks,
  includes=c("#include <iostream>",
             "#include <fstream>",
             "#include <vector>",
             "#include <string>",
             "#include <algorithm>",
             "#include <iterator>",
             "#include <boost/tokenizer.hpp>",
             "using namespace Rcpp;",
             "using namespace std;",
             "using namespace boost;")
)

Once that's sourced you can call count_blanks(FULLPATH) and it will return a numeric vector of counts of blank fields per line. 一旦获得该源，您可以调用count_blanks(FULLPATH) ，它将返回每行空白字段计数的数字向量。

I ran it against this file: 我针对这个文件运行它：

"DATE","APIKEY","FILENAME","LANGUAGE","JOBID","TRANSCRIPT"
1,2,3,4,5
1,,3,4,5
1,2,3,4,5
1,2,,4,5
1,2,3,4,5
1,2,3,,5
1,2,3,4,5
1,2,3,4,
1,2,3,4,5
1,,3,,5
1,2,3,4,5
,2,,4,
1,2,3,4,5

via: 通过：

count_blanks("/tmp/a.csv")
## [1] 0 0 1 0 1 0 1 0 1 0 2 0 3 0

CAVEATS CAVEATS

It's fairly obvious that it's not ignoring the header, so it could use a header logical parameter with associated C/C++ code (which will be pretty straightforward). 很明显，它不会忽略标头，因此它可以使用带有相关C / C ++代码的header逻辑参数（这将非常简单）。
If you're counting "spaces" (ie [:space:]+ ) as "empty" you'll need something a bit more complex than the call to length . 如果您将“空格”（即[:space:]+ ）计为“空”，则需要比调用length更复杂的东西。 This is one potential way to deal with it if you need to. 如果需要，这是一种处理它的潜在方法。
It's using the default configuration for the Boost function escaped_list_separator which is defined here . 它使用此处定义的Boost函数escaped_list_separator的默认配置。 That can also be customized with with quote & separator characters (making it possible to further mimic read.csv / read.table . 也可以使用quote和separator字符进行自定义（可以进一步模仿read.csv / read.table 。

This will more closely approach count.fields / C_countfields performance and will eliminate the need to consume memory by reading in every line just to find the lines you eventually want to more optimally target. 这将更接近count.fields / C_countfields性能，并且通过读取每一行来消除消耗内存的需要，只是为了找到最终想要更优化目标的行。 I don't think preallocating space for the returned numeric vector will add much to the speed, but you can see the discussion here which shows how to do so if need be. 我不认为为返回的数字向量预分配空间会增加速度，但是你可以在这里看到讨论，如果需要的话，它会显示如何做。

如何计算分隔文件中的非空字段数？

问题描述

1 个解决方案

解决方案1
6 2015-09-20 11:03:20

如何计算分隔文件中的非空字段数？

问题描述

1 个解决方案

解决方案1 6 2015-09-20 11:03:20

解决方案1
6 2015-09-20 11:03:20