[英]Python - walk thought directories to process csv files and save them
I want to walk through a list of csv
files inside folders, perform some calculation (always the same) on each file, and save a new file for each one. 我想遍历文件夹内的
csv
文件列表,对每个文件进行一些计算(总是相同),然后为每个文件保存一个新文件。
files have data structured in this manner: 文件的数据结构如下:
"[Couplet 10 : Jul]
C'est 1.3.5 sur la plaque
Fais ton biz coupe ta plaque
C'est JU, JU , JUL qui débarque
Pour mes blancs , beurres et blacks
Passe moi un stunt pour voir si sa cabre
Embrouilles sur le sable , cocotiers sur la sappe
Je dors pas je suis tout pâle, je dis pas que je suis 2Pac
Je dis pas lui je vais le tuer si j'ai même pas 2 balles
C'est pour ceux qui XXX fais gaffe les shmits l'impact
Son anti B.D.H anti tapette",1
(...)
So far I have: 到目前为止,我有:
match = "^[\(\[].*?[\)\]]"
for d in directories:
dir = os.path.join(data_dir, d)
files_ = [os.path.join(dir, f)
for f in os.listdir(dir)
if f.endswith(".csv")]
for f in files_:
with open(f, 'rb') as f1, open('out.csv', 'wb') as out_file:
reader = csv.reader(f1, delimiter='\t')
for item in list(reader):
item = re.sub(match, ' ', item, flags=re.MULTILINE)
out_file.write(item)
but I get this traceback: 但我得到这个回溯:
File "process_csv.py", line 75, in load_data
item = re.sub(match, ' ', item, flags=re.MULTILINE)
File "/Users/username/anaconda/lib/python2.7/re.py", line 155, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
what is the best way of achieving this? 实现此目标的最佳方法是什么?
According to the re docs , re.sub expects third parameter as a string. 根据re docs ,re.sub期望第三个参数为字符串。 But
list(reader)
returns list of lists with CSV fields, not strings. 但是
list(reader)
返回带有CSV字段而不是字符串的列表列表。 So you need to extract string from this lists and pass it to re.sub
: 因此,您需要从此列表中提取字符串并将其传递给
re.sub
:
item = re.sub(match, ' ', item[0], flags=re.MULTILINE)
or whatever index you need to use in the calculations. 或您需要在计算中使用的任何索引。
To understand it better, try: 为了更好地理解它,请尝试:
test.csv:
a
b
c
>>> f = open('test.csv')
>>> reader = csv.reader(f)
>>> list(reader)
[['a'], ['b'], ['c']]
UPDATE UPDATE
To make it working on the real data example: 要使其在真实数据示例中起作用:
"
(by default) or change regex if quotes are important for processing. "
(默认情况下)或更改正则表达式。 ''
when opening files. ''
。 In python 2 open
doesn't accept newline
argument, use io
package instead. open
不接受newline
,请改用io
包。 io
file opening has the same signature in general. io
文件打开通常具有相同的签名。 Explanation from CSV package documentation: If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \\r\\n linendings on write an extra \\r will be added.
如果未指定newline ='',则嵌入引号中的换行符将无法正确解释,并且在使用\\ r \\ n linendings的平台上将添加额外的\\ r。 It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
由于csv模块会执行自己的(通用)换行符处理,因此指定newline =''应该总是安全的。
with open(f, 'rb', newline='') as f1, open('out.csv', 'wb', newline='') as out_file:
...
sub
sub
Finally, corrected code: 最后,更正代码:
import io
...
match = "^[\(\[].*?[\)\]]"
for d in directories:
dir = os.path.join(data_dir, d)
files_ = [os.path.join(dir, f)
for f in os.listdir(dir)
if f.endswith(".csv")]
for f in files_:
with io.open(f, 'rb', newline='') as f1, io.open('out.csv', 'wb') as out_file:
reader = csv.reader(f1)
writer = csv.writer(out_file)
for item in reader:
writer.writerow([
re.sub(match, ' ', item[0], flags=re.MULTILINE),
item[1]
])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.