简体   繁体   English

使用python在CSV中切片特定字符

[英]Slice specific characters in CSV using python

I have data in tab delimited format that looks like: 我有以制表符分隔格式的数据,如下所示:

0/0:23:-1.03,-7.94,-83.75:69.15    0/1:34:-1.01,-11.24,-127.51:99.00    0/0:74:-1.02,-23.28,-301.81:99.00

I am only interested in the first 3 characters of each entry (ie 0/0 and 0/1). 我只对每个条目的前3个字符感兴趣(即0/0和0/1)。 I figured the best way to do this would be to use match and the genfromtxt in numpy. 我认为最好的方法是在numpy中使用matchgenfromtxt This example is as far as I have gotten: 这个例子是我得到的:

import re
csvfile = 'home/python/batch1.hg19.table'
from numpy import genfromtxt
data = genfromtxt(csvfile, delimiter="\t", dtype=None)
for i in data[1]:
    m = re.match('[0-9]/[0-9]', i)
        if m:
        print m.group(0),
        else:
        print "NA",

This works for the first row of the data which but I am having a hard time figuring out how to expand it for every row of the input file. 这适用于数据的第一行,但我很难弄清楚如何为输入文件的每一行扩展它。

Should I make it a function and apply it to each row seperately or is there a more pythonic way to do this? 我应该将它作为一个函数并单独应用于每一行,还是有更多的pythonic方法来做到这一点?

Unless you really want to use NumPy, try this: 除非你真的想使用NumPy,试试这个:

file = open('home/python/batch1.hg19.table')
for line in file:
    for cell in line.split('\t'):
        print(cell[:3])

Which just iterates through each line of the file, tokenizes the line using the tab character as the delimiter, then prints the slice of the text you are looking for. 其中只迭代文件的每一行,使用制表符作为分隔符对行进行标记,然后打印您要查找的文本的切片。

Numpy is great when you want to load in an array of numbers. 当你想加载一组数字时,Numpy很棒。 The format you have here is too complicated for numpy to recognize, so you just get an array of strings. 你在这里的格式太复杂了,无法识别numpy,所以你只需要一个字符串数组。 That's not really playing to numpy's strength. 这并不是真正发挥numpy的力量。

Here's a simple way to do it without numpy: 这是一个简单的方法,没有numpy这样做:

result=[]
with open(csvfile,'r') as f:
    for line in f:
        row=[]
        for text in line.split('\t'):
            match=re.search('([0-9]/[0-9])',text)
            if match:
                row.append(match.group(1))
            else:
                row.append("NA")
        result.append(row)
print(result)

yields 产量

# [['0/0', '0/1', '0/0'], ['NA', '0/1', '0/0']]

on this data: 关于这个数据:

0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00   0/0:74:-1.02,-23.28,-301.81:99.00
---:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00   0/0:74:-1.02,-23.28,-301.81:99.00

Its pretty easy to parse the whole file without regular expressions: 它很容易解析整个文件而没有正则表达式:

for line in open('yourfile').read().split('\n'):
    for token in line.split('\t'):
        print token[:3] if token else 'N\A'

I haven't written python in a while. 我有一段时间没写过python。 But I would probably write it as such. 但我可能会这样写。

file = open("home/python/batch1.hg19.table")
for line in file:
    columns = line.split("\t")
    for column in columns:
        print column[:3]
file.close()

Of course if you need to validate the first three characters, you'll still need the regex. 当然,如果你需要验证前三个字符,你仍然需要正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM