简体   繁体   English

Python:忽略CSV文件中的特定行

[英]Python: Ignoring specific rows in a csv file

I am trying to create a simple line graph to compare columns from two files. 我正在尝试创建一个简单的折线图来比较两个文件中的列。 I have written some code and would like to know how to ignore lines in the two .csv files that I have. 我已经写了一些代码,想知道如何忽略我拥有的两个.csv文件中的行。 The code is as follows: 代码如下:

import numpy as np
import csv
from matplotlib import pyplot as plt

def read_cell(x, y):
        with open('Illumina_Heart_Gencode_Paired_End_Novel_Junctions.csv', 'r') as f:
                reader = csv.reader(f)
                y_count = 0
                for n in reader:
                        if y_count == y:
                                cell = n[x]
                                return cell
                        y_count += 1
print(read_cell(6, 932)

def read_cell(x, y):
        with open('Illumina_Heart_RefSeq_Paired_End_Novel_Junctions.csv', 'r') as f:
                reader = csv.reader(f)
                y_count = 0
                for n in reader:
                        if y_count == y:
                                cell = n[x]
                                return cell
                        y_count += 1
print(read_cell(6, 932))


d1 = []
for i in set1:
    try:
        d1.append(float(i[5]))
    except ValueError:
        continue

d2 = []
for i in set2:
    try:
        d2.append(float(i[5]))
    except ValueError:
        continue

min_len = len(d1)
if len(d2) < min_len:
    min_len = len(d2)
d1 = d1[0:min_len]
d2 = d2[0:min_len]

plt.plot(d1, d2, 'r*')
plt.plot(d1, d2, 'b-')
plt.xlabel('Data Set 1: PE_NJ')
plt.ylabel('Data Set 2: PE_SJ')
plt.show()

The first csv file has 932 rows and the second one has 99,154 rows. 第一个csv文件具有932行,第二个csv文件具有99,154行。 I am only interested in taking the first 932 rows from both files and then want to compare the 7th column in both files. 我只想从两个文件中获取前932行,然后比较两个文件中的第七列。

How do I go about doing that? 我该怎么做?

The first file looks like this: 第一个文件如下所示:

chr1    1718493 1718764 2   2   0   12  0   24
chr1    8928117 8930883 2   2   0   56  0   24
chr1    8930943 8931949 2   2   0   48  0   25
chr1    9616316 9627341 1   1   0   12  0   24
chr1    10166642    10167279    1   1   0   31  1   24

The second file looks like so: 第二个文件如下所示:

chr1    880181  880421  2   2   0   15  0   21
chr1    1718493 1718764 2   2   0   12  0   24
chr1    8568735 8585817 2   2   0   12  0   21
chr1    8617583 8684368 2   2   0   14  0   23
chr1    8928117 8930883 2   2   0   56  0   24

One possible approach would be read all lines from the first (shorter) file, find out its length (N), read N lines from the second file, take the k th column you are interested with from both files. 一种可能的方法是,从第一个(较短的)文件中读取所有行,找出其长度(N),从第二个文件中读取N行,从两个文件中提取您感兴趣的第k列。

Something like (adjusting delimiter for your case): 类似于(根据情况调整定界符):

def read_tsv_file(fname): # reads the full contents of tab-separated file (like you have)
    return list(csv.reader(open(fname, 'rb'), delimiter='\t'))

def take_nth_column(first_array, second_array, n): # returns a tuple containing nth columns from both arrays, with length corresponding to the length of the smaller array
    len1 = len(first_array)
    len2 = len(second_array)
    min_len = len1 if len1<=len2 else len2
    col1 = [row[n] for row in first_array[:min_len]]
    col2 = [row[n] for row in second_array[:min_len]]
    return (col1, col2)


first_array = read_tsv_file('your-first-file')
second_array = read_tsv_file('your-second-file')
(col1, col2) = take_nth_column(first_array, second_array, 7)

So, your file isn't comma separated, which actually makes this a bit easier. 因此,您的文件不是逗号分隔的,这实际上使此操作变得容易一些。 We go through the first file and take the 7th item in each row after splitting the row on whitespace (tabs/spaces that separate the items in your data). 我们在分割空白行(用于分隔数据中各项的标签/空格)之后,浏览第一个文件并在每一行中获取第七项。 Then we do the same thing for the next file, but if we get past the 932nd line we break out of the loop and finish. 然后,对下一个文件执行相同的操作,但是如果我们超过了932nd行,我们就会跳出循环并完成。

I'd do it something like this: 我会做这样的事情:

file1_values = []
file2_values = []

with open('file1') as f1:
    for line in f1:
         seventh_column = line.split()[6]
         file1_values.append(seventh_column)

with open('file2') as f2:
    for i, line in enumerate(f2):
         if i > 932:
             break
         seventh_column = line.split()[6]
         file2_values.append(seventh_column)

Then, you have the values that you're interested in placed into two lists of hopefully equal length, and can go from there doing whatever comparisons or graphing you'd like to do. 然后,将您感兴趣的值放入两个希望长度相等的列表中,然后可以从那里进行任何比较或图形化操作。

EDIT : add delimiter option and precision on function definition 编辑:在函数定义上添加定界符选项和精度

If you just want keep one column and to stop reading after a count of line, simply append values to a list in your loop and break when it is exhausted. 如果您只想保留一列并在行数计数后停止读取,只需将值附加到循环中的列表中,并在耗尽时中断即可。 But if your file use anything else than a comma ( , ) as delimiter, you have to specify it. 但是,如果文件使用逗号( , )以外的其他内容作为分隔符,则必须指定它。 And do not repeat function definition : one def is enough. 并且不要重复函数定义:一个def就足够了。 So you reader function could be like : 因此,您的阅读器功能可能类似于:

def read_column(file_name, x, y):
        cells = []
        with open(file_name, 'r') as f:
                reader = csv.reader(f, delimiter="\t")
                y_count = 0
                for n in reader:
                        y_count += 1
                        if y_count > y:
                                break
                        cells.append(n[x])
       return cells

That way function returns a list with the x column on the y first lines 这样,函数将返回一个列表,其中x列位于y第一行

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM