简体   繁体   English

Python-走思想目录来处理csv文件并保存它们

[英]Python - walk thought directories to process csv files and save them

I want to walk through a list of csv files inside folders, perform some calculation (always the same) on each file, and save a new file for each one. 我想遍历文件夹内的csv文件列表,对每个文件进行一些计算(总是相同),然后为每个文件保存一个新文件。

files have data structured in this manner: 文件的数据结构如下:

"[Couplet 10 : Jul]
C'est 1.3.5 sur la plaque
Fais ton biz coupe ta plaque
C'est JU, JU , JUL qui débarque
Pour mes blancs , beurres et blacks
Passe moi un stunt pour voir si sa cabre
Embrouilles sur le sable , cocotiers sur la sappe
Je dors pas je suis tout pâle, je dis pas que je suis 2Pac
Je dis pas lui je vais le tuer si j'ai même pas 2 balles
C'est pour ceux qui XXX fais gaffe les shmits l'impact
Son anti B.D.H anti tapette",1

(...)

So far I have: 到目前为止,我有:

match = "^[\(\[].*?[\)\]]"
for d in directories:
        dir = os.path.join(data_dir, d)
        files_ = [os.path.join(dir, f) 
                      for f in os.listdir(dir) 
                      if f.endswith(".csv")]
        for f in files_:
            with open(f, 'rb') as f1, open('out.csv', 'wb') as out_file:
                reader = csv.reader(f1, delimiter='\t')
                for item in list(reader):
                item = re.sub(match, ' ', item, flags=re.MULTILINE)      
                out_file.write(item)

but I get this traceback: 但我得到这个回溯:

File "process_csv.py", line 75, in load_data
    item = re.sub(match, ' ', item, flags=re.MULTILINE)      
  File "/Users/username/anaconda/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

what is the best way of achieving this? 实现此目标的最佳方法是什么?

According to the re docs , re.sub expects third parameter as a string. 根据re docs ,re.sub期望第三个参数为字符串。 But list(reader) returns list of lists with CSV fields, not strings. 但是list(reader)返回带有CSV字段而不是字符串的列表列表。 So you need to extract string from this lists and pass it to re.sub : 因此,您需要从此列表中提取字符串并将其传递给re.sub

item = re.sub(match, ' ', item[0], flags=re.MULTILINE)

or whatever index you need to use in the calculations. 或您需要在计算中使用的任何索引。

To understand it better, try: 为了更好地理解它,请尝试:

test.csv: 
a 
b 
c

>>> f = open('test.csv')
>>> reader = csv.reader(f)
>>> list(reader)
[['a'], ['b'], ['c']]

UPDATE UPDATE

To make it working on the real data example: 要使其在真实数据示例中起作用:

  1. Set delimiter to " (by default) or change regex if quotes are important for processing. 如果引号对于处理很重要,请将定界符设置为" (默认情况下)或更改正则表达式。
  2. Specify newline character as '' when opening files. 打开文件时,将换行符指定为'' In python 2 open doesn't accept newline argument, use io package instead. 在python 2中open不接受newline ,请改用io包。 io file opening has the same signature in general. io文件打开通常具有相同的签名。 Explanation from CSV package documentation: CSV软件包文档中的说明:

If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \\r\\n linendings on write an extra \\r will be added. 如果未指定newline ='',则嵌入引号中的换行符将无法正确解释,并且在使用\\ r \\ n linendings的平台上将添加额外的\\ r。 It should always be safe to specify newline='', since the csv module does its own (universal) newline handling. 由于csv模块会执行自己的(通用)换行符处理,因此指定newline =''应该总是安全的。

 with open(f, 'rb', newline='') as f1, open('out.csv', 'wb', newline='') as out_file:
    ...
  1. It seems that substitution required for 1st column, so use item[0] for sub 似乎第一列需要替换,因此请使用item [0]作为sub

Finally, corrected code: 最后,更正代码:

import io

...

match = "^[\(\[].*?[\)\]]"
for d in directories:
    dir = os.path.join(data_dir, d)
    files_ = [os.path.join(dir, f) 
                  for f in os.listdir(dir) 
                  if f.endswith(".csv")]
    for f in files_:
        with io.open(f, 'rb', newline='') as f1, io.open('out.csv', 'wb') as out_file:
            reader = csv.reader(f1)
            writer = csv.writer(out_file) 
            for item in reader:
                writer.writerow([
                    re.sub(match, ' ', item[0], flags=re.MULTILINE),
                    item[1]
                ])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我想使用python遍历目录以获取文本文件并对其进行处理 - I want to use python to walk through directories to get to text files and processed them Python:遍历目录并将所有文件夹名称、子文件夹和文件保存在一个 csv 文件中 - Python: Walk through directory and save all foldernames, subfolder and files in a csv-file Python os.walk() 获取目录前的文件 - Python os.walk() get files before directories 目录遍历和删除文件/目录 - Directory walk and remove files/directories 逻辑python问题-处理其中的目录和文件 - Logical python question - handling directories and files in them 导入带有网址的.csv并对其进行处理(PYTHON) - Importing a .csv with urls and process them (PYTHON) Python 中是否有一种方法可以在不使用 os.walk、glob 或 fnmatch 的情况下递归搜索目录、子目录和文件? - Is there a way in Python to search directories, subdirectories and files recursively without using os.walk, glob, or fnmatch? 在给定某些约束的情况下,如何使用 Python 遍历目录中的文件和 output 和 pandas 数据框? - How can I use Python to walk through files in directories and output a pandas data frame given certain constraints? Python OS.WALK 删除目录 - Python OS.WALK Remove Directories 带有某些目录的python walk目录树 - python walk directory tree with excluding certain directories
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM