简体   繁体   English

Python:常规CSV文件解析和操作

[英]Python: General CSV file parsing and manipulation

The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. 我的Python脚本的目的是比较多个CSV文件中存在的数据,寻找差异。 The data are ordered, but the ordering differs between files. 数据是有序的,但文件之间的顺序不同。 The files contain about 70K lines, weighing around 15MB. 这些文件包含大约70K行,重约15MB。 Nothing fancy or hardcore here. 没有什么花哨或硬核在这里。 Here's part of the code: 这是代码的一部分:

def getCSV(fpath):
    with open(fpath,"rb") as f:
        csvfile = csv.reader(f)

        for row in csvfile:
            allRows.append(row)

allCols = map(list, zip(*allRows))
  • Am I properly reading from my CSV files? 我是否正确阅读了我的CSV文件? I'm using csv.reader , but would I benefit from using csv.DictReader ? 我正在使用csv.reader ,但是我会从使用csv.DictReader受益吗?
  • How can I create a list containing whole rows which have a certain value in a precise column? 如何创建包含在精确列中具有特定值的整行的列表?

This should work, you don't need to make another list to have access to the columns. 这应该有效,您不需要创建另一个列表来访问列。

import csv
import sys

def getCSV(fpath):
    with open(fpath) as ifile:
        csvfile = csv.reader(ifile)

        rows = list(csvfile)

    value_20 = [x for x in rows if x[20] == 'value']

Are you sure you want to be keeping all rows around? 你确定要保留所有行吗? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. 这将创建一个仅包含匹配值的列表... fname也可以来自glob.glob()os.listdir()或您选择的任何其他数据源。 Just to note, you mention the 20th column, but row[20] will be the 21st column... 需要注意的是,你提到了第20栏,但第[20]行将是第21列......

import csv

matching20 = []

for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
    with open(fname) as fin:
        csvin = csv.reader(fin)
        next(csvin) # <--- if you want to skip header row
        for row in csvin:
            if row[20] == 'value':
                matching20.append(row) # or do something with it here

You only want csv.DictReader if you have a header row and want to access your columns by name. 如果您有标题行并希望按名称访问列,则只需要csv.DictReader

If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct? 如果我正确理解了这个问题,如果value在行中,你想要包含一行,但是你不知道哪个列value是正确的?

If your rows are lists, then this should work: 如果您的行是列表,那么这应该工作:

testlist = [row for row in allRows if 'value' in row]

post-edit: 后期编辑:

If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos , then: 如果,如您所说,您想要一个value列在指定列中的行列表(由整数pos指定,则:

testlist = []
pos = 20
for row in allRows:
    testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])

(I haven't tested this, but let me now if that works). (我没有对此进行过测试,但如果有效,请告诉我)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM