[英]How to remove a specific string common in multiple lines in a CSV file using python script?
我有一个csv文件,其中包含65000行(大小约为28 MB)。 在每一行中,都以开头指定路径,例如"c:\\abc\\bcd\\def\\123\\456"
。 现在,假设路径"c:\\abc\\bcd\\"
在所有行中都是通用的,其余内容则有所不同。 我必须使用python脚本从所有行中删除公共部分(在这种情况下为"c:\\abc\\bcd\\"
)。 例如,CSV文件的内容如前所述。
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert 87 116 69
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0
在上面的示例中,我需要以下输出
FILE0.frag 0 0 0
FILE0.vert 0 0 0
FILE0.link-link-0.frag 17 25 2
FILE0.link-link-0.vert 85 111 68
FILE0.link-link-0.vert.bin 77 97 60
FILE0.link-link-0 0 0
FILE0.link 0 0 0
谁能帮我这个忙吗?
那怎么样
import csv
with open("file.csv", 'rb') as f:
sl = []
csvread = csv.reader(f, delimiter=' ')
for line in csvread:
sl.append(line.replace("C:/Abc/Def/Test/temp\.\test\GLNext\", ""))
要将清单sl
写出以filenew
使用,
with open('filenew.csv', 'wb') as f:
csvwrite = csv.writer(f, delimiter=' ')
for line in sl:
csvwrite.writerow(line)
^\S+/
您只需在每行上使用此正则表达式并替换为empty string
参见演示。
https://regex101.com/r/cK4iV0/17
import re
p = re.compile(ur'^\S+/', re.MULTILINE)
test_str = u"C:/Abc/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert 87 116 69\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert.bin 75 95 61\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0 "
subst = u" "
result = re.sub(p, subst, test_str)
您可以自动检测公用前缀,而无需对其进行硬编码。 您实际上不需要regex
。 可以使用os.path.commonprefix
代替:
import csv
import os
with open('data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
paths = [] #stores all paths
rows = [] #stores all lines
for row in reader:
paths.append(row[0].split("/")) #split path by "/"
rows.append(row)
commonprefix = os.path.commonprefix(paths) #finds prefix common to all paths
for row in rows:
row[0] = row[0].replace('/'.join(commonprefix)+'/', "") #remove prefix
rows
现在有一个列表列表,您可以将其写入文件
with open('data2.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
for row in rows:
writer.writerow(row)
以下Python脚本将读取您的文件(假设它看起来像您的示例),并将创建一个删除公用文件夹的版本:
import os.path, csv
finput = open("d:\\input.csv","r")
csv_input = csv.reader(finput, delimiter=" ", skipinitialspace=True)
csv_output = csv.writer(open("d:\\output.csv", "wb"), delimiter=" ")
# Create a set of unique folder names
set_folders = set()
for input_row in csv_input:
set_folders.add(os.path.split(input_row[0])[0])
# Determine the common prefix
base_folder = os.path.split(os.path.commonprefix(set_folders))[0]
nprefix = len(base_folder) + 1
# Go back to the start of the input CSV
finput.seek(0)
for input_row in csv_input:
csv_output.writerow([input_row[0][nprefix:]] + input_row[1:])
使用以下内容作为输入:
C:/Abc/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3
C:/Abc/Def/Test/temp/test/GLNext2/FILE0.link-link-0.vert 87 116 69
C:/Abc/Def/Test/temp/test/GLNext5/FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/Def/Test/temp/test/GLNext7/FILE0.link-link-0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0
输出如下:
GLNext/FILE0.frag 0 0 0
GLNext/FILE0.vert 0 0 0
GLNext/FILE0.link-link-0.frag 16 24 3
GLNext2/FILE0.link-link-0.vert 87 116 69
GLNext5/FILE0.link-link-0.vert.bin 75 95 61
GLNext7/FILE0.link-link-0 0 0
GLNext/FILE0.link-link-6 0 0 0
每列之间只有一个空格,尽管可以轻松更改。
所以我尝试了这样的事情
for dirName, subdirList, fileList in os.walk(Directory):
for fname in fileList:
if fname.endswith('.csv'):
for line in fileinput.input(os.path.join(dirName, fname), inplace = 1):
location = line.find(r'GLNext')
if location > 0:
location += len('GLNext')
print line.replace(line[:location], ".")
else:
print line
您可以为此使用pandas
库。 这样做,您可以利用pandas
对大型CSV文件(甚至数百MB)的出色处理。
码:
import pandas as pd
csv_file = 'test_csv.csv'
df = pd.read_csv(csv_file, header=None)
print df
print "-------------------------------------------"
path = "C:/Abc/bcd/Def/Test/temp/test/GLNext/"
df[0] = df[0].replace({path:""}, regex=True)
print df
# df.to_csv("truncated.csv") # Export to new file.
结果:
0 1 2 3
0 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
1 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
2 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 16 24 3
3 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 87 116 69
4 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 75 95 61
5 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 0 0 NaN
6 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 0 0 0
-------------------------------------------
0 1 2 3
0 FILE0.frag 0 0 0
1 FILE0.vert 0 0 0
2 FILE0.link-link-0.frag 16 24 3
3 FILE0.link-link-0.vert 87 116 69
4 FILE0.link-link-0.vert.bin 75 95 61
5 FILE0.link-link-0 0 0 NaN
6 FILE0.link-link-6 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.