简体   繁体   English

在Matlab中读取具有不同大小的行的csv文件

[英]read csv file in matlab with rows of different size

I'm going to read a big csv file in matlab which contains rows like this: 我将在matlab中读取一个大的csv文件,其中包含以下行:

1, 0, 1, 0, 1
1, 0, 1, 0, 1, 0, 1, 0, 1
1, 0, 1
1, 0, 1
1, 0, 1, 0, 1
0, 1, 0, 1, 0, 1, 0, 1, 0

For reading big files I'm using textscan however I should define number of expected parameters in each line of text file. 为了读取大文件,我正在使用textscan但是我应该在文本文件的每一行中定义一些期望的参数。

Using csvread helps but it is too slow and seems to be not efficient. 使用csvread帮助,但速度太慢,似乎效率不高。 Are there any methods to use textscan with uknown number of inputs in each line? 有什么方法可以在每行中使用textscan和未知数量的输入? or do you have any other suggestion for this situation? 还是您对此情况有其他建议?

Since you said " Numerical matrix padded with zeros would be good ", there is a solution using textscan which can give you that. 因为您说过“ 用零填充数字矩阵会很好 ”,所以有一种使用textscan的解决方案可以为您提供解决方案。 The catch however is you have to know the maximum number of element a line can have (ie the longest line in your file). 但是要注意的是,您必须知道一行可以包含的最大元素数(即文件中最长的一行)。

Provided you know that, then a combination of the additional parameters for textscan allow you to read an incomplete line: 只要您知道,然后结合使用textscan的其他参数,即可读取不完整的行:

If you set the parameter 'EndOfLine','\\r\\n' , the documentation explains: 如果设置参数'EndOfLine','\\r\\n' ,则文档说明:

If there are missing values and an end-of-line sequence at the end of the last line in a file, then textscan returns empty values for those fields. 如果文件中最后一行的末尾缺少值和行尾序列,则textscan将为这些字段返回空值。 This ensures that individual cells in output cell array, C, are the same size. 这确保了输出像元阵列C中的各个像元大小相同。

So with the example data in your question saved as differentRows.txt , the following code: 因此,将问题中的示例数据另存为differentRows.txt ,以下代码:

% be sure about this, better to overestimate than underestimate
maxNumberOfElementPerLine = 10 ;

% build a reading format which can accomodate the longest line
readFormat = repmat('%f',1,maxNumberOfElementPerLine) ;

fidcsv = fopen('differentRows.txt','r') ;

M = textscan( fidcsv , readFormat , Inf ,...
    'delimiter',',',...
    'EndOfLine','\r\n',...
    'CollectOutput',true) ;

fclose(fidcsv) ;
M = cell2mat(M) ; % convert to numerical matrix

will return: 将返回:

>> M
M =
     1     0     1     0     1   NaN   NaN   NaN   NaN   NaN
     1     0     1     0     1     0     1     0     1   NaN
     1     0     1   NaN   NaN   NaN   NaN   NaN   NaN   NaN
     1     0     1   NaN   NaN   NaN   NaN   NaN   NaN   NaN
     1     0     1     0     1   NaN   NaN   NaN   NaN   NaN
     0     1     0     1     0     1     0     1     0   NaN

As an alternative, if it makes a significant speed difference, you could import your data into integers instead of double. 或者,如果速度差异很大,则可以将数据导入整数而不是double。 The trouble with that is NaN is not defined for integers, so you have 2 options: 麻烦的是, NaN没有为整数定义,因此您有2个选择:

  • 1) Leave the empty entries to the default 0 1)将空白条目保留为默认值0

just replace the line which define the format specifier with: 只需将定义格式说明符的行替换为:

% build a reading format which can accomodate the longest line
readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;

This will return: 这将返回:

>> M
M =
1   0   1   0   1   0   0   0   0   0
1   0   1   0   1   0   1   0   1   0
1   0   1   0   0   0   0   0   0   0
1   0   1   0   0   0   0   0   0   0
1   0   1   0   1   0   0   0   0   0
0   1   0   1   0   1   0   1   0   0

  • 2) Replace the empty entries with a placeholder (for ex: 99 ) 2)用占位符替换条目(例如: 99

Define a value which you are sure you'll never have in your original data (for quick identification of empty cells), then use the EmptyValue parameter of the textscan function: 定义一个您肯定不会在原始数据中拥有的值(用于快速识别空单元格),然后使用textscan函数的EmptyValue参数:

readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;
DefaultEmptyValue = 99 ; % placeholder for "empty values"

fidcsv = fopen('differentRows.txt','r') ;
M = textscan( fidcsv , readFormat , Inf ,...
    'delimiter',',',...
    'EndOfLine','\r\n',...
    'CollectOutput',true,...
    'EmptyValue',DefaultEmptyValue) ;

will yield: 将产生:

>> M
M =
1   0   1   0   1   99  99  99  99  99
1   0   1   0   1   0   1   0   1   99
1   0   1   99  99  99  99  99  99  99
1   0   1   99  99  99  99  99  99  99
1   0   1   0   1   99  99  99  99  99
0   1   0   1   0   1   0   1   0   99

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM