简体   繁体   English

在Python中读取scipy / numpy中的csv文件

[英]reading csv files in scipy/numpy in Python

I am having trouble reading a csv file, delimited by tabs, in python. 我在python中读取由制表符分隔的csv文件时遇到问题。 I use the following function: 我使用以下功能:

def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
    """
    Parse a file name into an array. Return the array and additional header lines. By default,
    parse the header lines into dictionaries, assuming the parameters are numeric,
    using 'parse_header'.
    """
    f = open(filename, 'r')
    skipped_rows = []
    for n in range(skiprows):
        header_line = f.readline().strip()
        if raw_header:
            skipped_rows.append(header_line)
        else:
            skipped_rows.append(parse_header(header_line))
    f.close()
    if missing:
        data = genfromtxt(filename, dtype=None, names=with_header,
                          deletechars='', skiprows=skiprows, missing=missing)
    else:
    if delimiter != '\t':
        data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
                  deletechars='', skiprows=skiprows)
    else:
        data = genfromtxt(filename, dtype=None, names=with_header,
                  deletechars='', skiprows=skiprows)        
    if data.ndim == 0:
    data = array([data.item()])
    return (data, skipped_rows)

the problem is that genfromtxt complains about my files, eg with the error: 问题是genfromtxt抱怨我的文件,例如错误:

Line #27100 (got 12 columns instead of 16)

I am not sure where these errors come from. 我不确定这些错误来自哪里。 Any ideas? 有任何想法吗?

Here's an example file that causes the problem: 这是导致问题的示例文件:

#Gene   120-1   120-3   120-4   30-1    30-3    30-4    C-1 C-2 C-5 genesymbol  genedesc
ENSMUSG00000000001  7.32    9.5 7.76    7.24    11.35   8.83    6.67    11.35   7.12    Gnai3   guanine nucleotide binding protein alpha
ENSMUSG00000000003  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn    probasin

Is there a better way to write a generic csv2array function? 有没有更好的方法来编写通用的csv2array函数? thanks. 谢谢。

Check out the python CSV module: http://docs.python.org/library/csv.html 查看python CSV模块: http//docs.python.org/library/csv.html

import csv
reader = csv.reader(open("myfile.csv", "rb"), 
                    delimiter='\t', quoting=csv.QUOTE_NONE)

header = []
records = []
fields = 16

if thereIsAHeader: header = reader.next()

for row, record in enumerate(reader):
    if len(record) != fields:
        print "Skipping malformed record %i, contains %i fields (%i expected)" %
            (record, len(record), fields)
    else:
        records.append(record)

# do numpy stuff.

May I ask why you're not using the built-in csv reader? 请问你为什么不使用内置的csv阅读器? http://docs.python.org/library/csv.html http://docs.python.org/library/csv.html

I've used it very effectively with numpy/scipy. 我已经非常有效地使用了numpy / scipy。 I would share my code but unfortunately it's owned by my employer, but it should be very straightforward to write your own. 我会分享我的代码,但遗憾的是它由我的雇主拥有,但编写自己的代码应该非常简单。

I think Nick T's approach would be the better way to go. 我认为Nick T的方法是更好的方法。 I would make one change. 我会做一个改变。 As I would replace the following code: 正如我将替换以下代码:

for row, record in enumerate(reader):
if len(record) != fields:
    print "Skipping malformed record %i, contains %i fields (%i expected)" %
        (record, len(record), fields)
else:
    records.append(record)

with

records = np.asrray([row for row in reader if len(row) = fields ])
print('Number of skipped records: %i'%(len(reader)-len(records)) #note you have to do more than len(reader) as an iterator does not have a length like a list or tuple

The list comprehension will return a numpy array and take advantage of pre-compiled libraries which should speed things up greatly. 列表理解将返回一个numpy数组并利用预编译的库,这应该会大大加快速度。 Also, I would recommend using print() as a function versus print "" as the former is the standard for python3 which is most likely the future and I would use logging over print. 另外,我建议使用print()作为函数而不是print“”,因为前者是python3的标准,很可能是未来,我会使用log over print。

Likely it came from Line 27100 in your data file... and it had 12 columns instead of 16. Ie it had: 可能它来自您的数据文件中的第27100行......它有12列而不是16列。即它有:

separator,1,2,3,4,5,6,7,8,9,10,11,12,separator

And it was expecting something like this: 它期待这样的事情:

separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator

I'm not sure how you want to convert your data, but if you have irregular line lengths, the easiest way would be something like this: 我不确定你想如何转换你的数据,但如果你有不规则的线长,最简单的方法是这样的:

lines = f.read().split('someseparator')
for line in lines:
    splitline = line.split(',')
    #do something with splitline

I have successfully used two methodologies; 我成功地使用了两种方法; (1): if I simply need to read arbitrary CSV, I used the CSV module (as pointed out by other users), and (2): if I require repeated processing of a known CSV (or any) format, I write a simple parser. (1):如果我只需要读取任意CSV,我使用CSV模块(如其他用户所指出的),以及(2):如果我需要重复处理已知的CSV(或任何)格式,我会写一个简单的解析器。

It seems that your problem fits in the second category, and a parser should be very simple: 看来您的问题适合第二类,解析器应该非常简单:

f = open('file.txt', 'r').readlines()
for line in f:
 tokens = line.strip().split('\t')
 gene = tokens[0]
 vals = [float(k) for k in tokens[1:10]]
 stuff = tokens[10:]
 # do something with gene, vals, and stuff

You can add a line in the reader for skipping comments (`if tokens[0] == '#': continue') or to handle blank lines ('if tokens == []: continue'). 您可以在阅读器中添加一行来跳过注释(`if tokens [0] =='#':continue')或处理空行('if tokens == []:continue')。 You get the idea. 你明白了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM