[英]Skip the first column when reading a csv file Python
I'm trying to read a csv file and extract required data from it. 我正在尝试读取一个csv文件并从中提取所需的数据。 My code looks like below. 我的代码如下所示。
import csv
file = "sample.csv"
def get_values_flexibly(file, keyword):
def process(func):
return set([func(cell)] + [func(row[index]) for row in reader])
with open(file, 'r') as f:
reader = csv.reader(f)
first_row = reader.next()
if keyword in first_row:
return str(list(set([row[first_row.index(keyword)] for row in reader])))
for index, cell in enumerate(reader.next()):
if cell.endswith(' ' + keyword):
return str(list(set(process(lambda cell: cell[:-len(keyword) - 1]))))
elif cell.split(':')[0].strip() == keyword:
return str(list(set(process(lambda cell: cell.split(':')[1].strip()))))
print get_values_flexibly(file, 'data')
where sample.csv looks something like below 其中sample.csv如下所示
sample.csv sample.csv
h1,h2,h3
a data,data: abc,tr
b data,vf data, gh
k data,grt data, ph
I'd like to exclude first column from the output. 我想从输出中排除第一列。 My current output is ['a','k','b']
but I'd like it to be ['abc', 'vf', 'grt']
instead. 我当前的输出是['a','k','b']
但我希望它改为['abc', 'vf', 'grt']
。 How can I achieve this using csv reader? 如何使用csv阅读器实现此目的?
EDIT- I have multiple files. 编辑-我有多个文件。 Each file could have different headers and number of columns varies too. 每个文件可以具有不同的标题,并且列数也有所不同。 I'd like to have a script that works for all the files. 我想要一个适用于所有文件的脚本。 Also, the header of the first column is always the same, "sample_column" for instance. 同样,第一列的标题始终相同,例如“ sample_column”。 I'd like to skip data from column with header "sample_column". 我想从标题为“ sample_column”的列中跳过数据。
You could use the dict reader: 您可以使用字典阅读器:
data = {'h1':[], 'h2':[], 'h3':[]}
h = ['h1', 'h2', 'h3']
csvfile = open(dir)
reader = DictReader(csvfile)
for line in reader:
data['h1'].append(line[h[0]][2:])
data['h2'].append(line[h[1]][2:]) # Use indexing to get the bits you want
data['h3'].append(line[h[2]])
Ok, so removing the data
(or whichever the keyword is) could be done with a regular expression (which is not really the scope of the question but meh...) 好的,因此可以使用正则表达式(实际上不是问题的范围,但是……)删除data
(或关键字是哪个关键字)。
About the regular expression: 关于正则表达式:
Let's imagine your keyword is data
, right? 假设您的关键字是data
,对吗? You can use this: (?:data)*\\W*(?P<juicy_data>\\w+)\\W*(?:data)*
If your keyword was something else, you can just change the two data
strings in that regular expression to whatever other value the keyword
contains... 您可以使用以下命令:( (?:data)*\\W*(?P<juicy_data>\\w+)\\W*(?:data)*
如果您使用的是其他关键字,则只需在该常规字符串中更改两个data
字符串keyword
包含的任何其他值的表达式...
You can test regular expressions online in www.pythonregex.com or www.debuggex.com 您可以在www.pythonregex.com或www.debuggex.com上在线测试正则表达式
The regular expression is basically saying: Look for zero or more data
strings but (if you find any) don't do anything with them. 正则表达式基本上是在说:寻找零个或多个data
字符串,但是(如果找到的话)对它们不做任何事情。 Don't add them to the list of matched groups, don't show them... nothing, just match them but discard it. 不要将它们添加到匹配的组列表中,不要显示它们……什么也不要,只要匹配它们,然后将其丢弃即可。 After that, look for zero or more non-word characters (anything that is not a letter or a number... just in case there's a data
: or a space after , or a data-->
... that \\W
removes all the non-alphanumerical characters that came after data
) Then you get to your juicy_data
That is one or more characters that can be found in "regular" words (any alphanumeric character). 之后,寻找零个或多个非单词字符(不是字母或数字的任何字符……以防万一有data
:或\\W
后面的空格,或data-->
...被\\W
删除data
之后的所有非字母数字字符)然后您进入juicy_data
这是可以在“常规”字词中找到的一个或多个字符(任何字母数字字符)。 Then, just in case there's a data
behind it, do the same that it was done with the first data
group. 然后,以防万一背后有data
,请执行与第一个data
组相同的操作。 Just match it and remove it. 只需将其匹配并删除即可。
Now, to remove the first column: You can use the fact that a csv.reader is itself an iterator. 现在,删除第一列:您可以使用csv.reader本身就是迭代器的事实。 When you iterate over it (as the code below does), it gives you a list containing all the columns found in one row. 当您对其进行迭代时(如下面的代码所示),它将为您提供一个包含在一行中找到的所有列的列表。 The fact that it gives you a list
of all the rows is very useful for your case: You just have to collect the first item of said row
, since that's the column you care about (you don't need row[0]
, nor row[1:]
) 它为您提供了所有行的list
这一事实对于您的情况非常有用:您只需要收集所述row
的第一项,因为这是您关心的列(不需要row[0]
,也不需要row[1:]
)
So here it goes: 所以就这样:
import csv
import re
def get_values_flexibly(csv_path, keyword):
def process(func):
return set([func(cell)] + [func(row[index]) for row in reader])
# Start fo real!
kwd_remover = re.compile(
r'(?:{kw})*\W*(?P<juicy_data>\w+)\W*(?:{kw})*'.format(kw=keyword)
)
result = []
with open(csv_path, 'r') as f:
reader = csv.reader(f)
first_row = [kwd_remover.findall(cell)[0] for cell in reader.next()]
print "Cleaned first_row: %s" % first_row
for index, row in enumerate(reader):
print "Before cleaning: %s" % row
cleaned_row = [kwd_remover.findall(cell)[0] for cell in row]
result.append(cleaned_row[1])
print "After cleaning: %s" % cleaned_row
return result
print "Result: %s" % get_values_flexibly("sample.csv", 'data')
Outputs: 输出:
Cleaned first_row: ['h1', 'h2', 'h3']
Before cleaning: ['a data', 'data: abc', 'tr']
After cleaning: ['a', 'abc', 'tr']
Before cleaning: ['b data', 'vf data', ' gh']
After cleaning: ['b', 'vf', 'gh']
Before cleaning: ['k data', 'grt data', ' ph']
After cleaning: ['k', 'grt', 'ph']
Result: ['abc', 'vf', 'grt']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.