简体   繁体   English

读取csv文件Python时跳过第一列

[英]Skip the first column when reading a csv file Python

I'm trying to read a csv file and extract required data from it. 我正在尝试读取一个csv文件并从中提取所需的数据。 My code looks like below. 我的代码如下所示。

import csv
file = "sample.csv"
def get_values_flexibly(file, keyword):
    def process(func):
        return set([func(cell)] + [func(row[index]) for row in reader])

    with open(file, 'r') as f:
        reader = csv.reader(f)
        first_row = reader.next()
        if keyword in first_row:
            return str(list(set([row[first_row.index(keyword)] for row in reader])))
        for index, cell in enumerate(reader.next()):
            if cell.endswith(' ' + keyword):
                return str(list(set(process(lambda cell: cell[:-len(keyword) - 1]))))
            elif cell.split(':')[0].strip() == keyword:
                return str(list(set(process(lambda cell: cell.split(':')[1].strip()))))
print get_values_flexibly(file, 'data')

where sample.csv looks something like below 其中sample.csv如下所示

sample.csv sample.csv

h1,h2,h3
a data,data: abc,tr
b data,vf data, gh
k data,grt data, ph

I'd like to exclude first column from the output. 我想从输出中排除第一列。 My current output is ['a','k','b'] but I'd like it to be ['abc', 'vf', 'grt'] instead. 我当前的输出是['a','k','b']但我希望它改为['abc', 'vf', 'grt'] How can I achieve this using csv reader? 如何使用csv阅读器实现此目的?

EDIT- I have multiple files. 编辑-我有多个文件。 Each file could have different headers and number of columns varies too. 每个文件可以具有不同的标题,并且列数也有所不同。 I'd like to have a script that works for all the files. 我想要一个适用于所有文件的脚本。 Also, the header of the first column is always the same, "sample_column" for instance. 同样,第一列的标题始终相同,例如“ sample_column”。 I'd like to skip data from column with header "sample_column". 我想从标题为“ sample_column”的列中跳过数据。

You could use the dict reader: 您可以使用字典阅读器:

data = {'h1':[], 'h2':[], 'h3':[]}

h = ['h1', 'h2', 'h3']
csvfile = open(dir)
reader = DictReader(csvfile)
for line in reader:
    data['h1'].append(line[h[0]][2:])
    data['h2'].append(line[h[1]][2:])  # Use indexing to get the bits you want
    data['h3'].append(line[h[2]])

Ok, so removing the data (or whichever the keyword is) could be done with a regular expression (which is not really the scope of the question but meh...) 好的,因此可以使用正则表达式(实际上不是问题的范围,但是……)删除data (或关键字是哪个关键字)。

About the regular expression: 关于正则表达式:

Let's imagine your keyword is data , right? 假设您的关键字是data ,对吗? You can use this: (?:data)*\\W*(?P<juicy_data>\\w+)\\W*(?:data)* If your keyword was something else, you can just change the two data strings in that regular expression to whatever other value the keyword contains... 您可以使用以下命令:( (?:data)*\\W*(?P<juicy_data>\\w+)\\W*(?:data)*如果您使用的是其他关键字,则只需在该常规字符串中更改两个data字符串keyword包含的任何其他值的表达式...

You can test regular expressions online in www.pythonregex.com or www.debuggex.com 您可以在www.pythonregex.comwww.debuggex.com上在线测试正则表达式

The regular expression is basically saying: Look for zero or more data strings but (if you find any) don't do anything with them. 正则表达式基本上是在说:寻找零个或多个data字符串,但是(如果找到的话)对它们不做任何事情。 Don't add them to the list of matched groups, don't show them... nothing, just match them but discard it. 不要将它们添加到匹配的组列表中,不要显示它们……什么也不要,只要匹配它们,然后将其丢弃即可。 After that, look for zero or more non-word characters (anything that is not a letter or a number... just in case there's a data : or a space after , or a data--> ... that \\W removes all the non-alphanumerical characters that came after data ) Then you get to your juicy_data That is one or more characters that can be found in "regular" words (any alphanumeric character). 之后,寻找零个或多个非单词字符(不是字母或数字的任何字符……以防万一有data :或\\W后面的空格,或data--> ...被\\W删除data之后的所有非字母数字字符)然后您进入juicy_data这是可以在“常规”字词中找到的一个或多个字符(任何字母数字字符)。 Then, just in case there's a data behind it, do the same that it was done with the first data group. 然后,以防万一背后有data ,请执行与第一个data组相同的操作。 Just match it and remove it. 只需将其匹配并删除即可。

Now, to remove the first column: You can use the fact that a csv.reader is itself an iterator. 现在,删除第一列:您可以使用csv.reader本身就是迭代器的事实。 When you iterate over it (as the code below does), it gives you a list containing all the columns found in one row. 当您对其进行迭代时(如下面的代码所示),它将为您提供一个包含在一行中找到的所有列的列表。 The fact that it gives you a list of all the rows is very useful for your case: You just have to collect the first item of said row , since that's the column you care about (you don't need row[0] , nor row[1:] ) 它为您提供了所有行的list这一事实对于您的情况非常有用:您只需要收集所述row的第一项,因为这是您关心的列(不需要row[0] ,也不需要row[1:]

So here it goes: 所以就这样:

import csv
import re

def get_values_flexibly(csv_path, keyword):
    def process(func):
        return set([func(cell)] + [func(row[index]) for row in reader])
    # Start fo real!
    kwd_remover = re.compile(
        r'(?:{kw})*\W*(?P<juicy_data>\w+)\W*(?:{kw})*'.format(kw=keyword)
    )
    result = []
    with open(csv_path, 'r') as f:
        reader = csv.reader(f)
        first_row = [kwd_remover.findall(cell)[0] for cell in reader.next()]
        print "Cleaned first_row: %s" % first_row
        for index, row in enumerate(reader):
            print "Before cleaning: %s" % row
            cleaned_row = [kwd_remover.findall(cell)[0] for cell in row]
            result.append(cleaned_row[1])
            print "After cleaning: %s" % cleaned_row
    return result

print "Result: %s" %  get_values_flexibly("sample.csv", 'data')

Outputs: 输出:

Cleaned first_row: ['h1', 'h2', 'h3']
Before cleaning: ['a data', 'data: abc', 'tr']
After cleaning: ['a', 'abc', 'tr']
Before cleaning: ['b data', 'vf data', ' gh']
After cleaning: ['b', 'vf', 'gh']
Before cleaning: ['k data', 'grt data', ' ph']
After cleaning: ['k', 'grt', 'ph']
Result: ['abc', 'vf', 'grt']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM