![](/img/trans.png)
[英]How to select certain columns from a csv file in pyspark based on the list of index of columns and then determine their distinct lengths
[英]Removing the list of columns from csv file with index
我有一个CSV文件,其内容如下:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,10,19,,,,,,,,,,,,,
2,11,20,,,,,,,,,,,,,
3,12,21,,,,,,,,,,,,,
4,13,22,,,,,,,,,,,,,
5,14,23,,,,,,,,,,,,,
6,15,24,,,,,,,,,,,,,
7,16,25,,,,,,,,,,,,,
8,17,26,,,,,,,,,,,,,
9,18,27,,,,,,,,,,,,,
我需要按索引删除一些列集。
我尝试了以下代码,它没有返回预期的结果,有人可以帮助我。
import csv
def read():
with open("test.csv", "rb") as fp_in, open("newfile.csv", "wb") as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
col_list = [0,1,2,3,4,5,6,8]
for row in reader:
for col_item in col_list:
print(col_item)
del row[int(col_item)]
writer.writerow(row)
read()
返回结果:
1,3,5,7,9,11,13,14
10,,,,,,,
11,,,,,,,
12,,,,,,,
13,,,,,,,
14,,,,,,,
15,,,,,,,
16,,,,,,,
17,,,,,,,
18,,,,,,,
问题是因为每次迭代的读者总是相同的,所以我需要删除列表中的所有列。
有人帮我一样。
所需的输出应如下所示:
7,9,10,11,12,13,14,15
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
.
.
.
.
准确地说,我只想删除提到的列及其值。
编辑:
一些明确的例子。
def read():
with open("test.csv", "rb") as fp_in, open("newfile.csv", "wb") as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
col_list = [0,2]
for row in reader:
for col_item in col_list:
print(col_item)
del row[int(col_item)]
writer.writerow(row)
read()
我得到的输出:
1,2,4
v,d,q
c,s,a
s,d,d
f,x,c
预期:
1,3,4
v,s,q
c,d,a
s,f,d
f,a,c
问题是您要在col_list的每次迭代中更改行。
这应该起作用; 使用列表推导来复制没有col_list中的索引的行的副本。
def read():
with open("test.csv", "r") as fp_in, open("newfile.csv", "w") as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
col_list = [0,1,2,3,4,5,6,8]
for row in reader:
output = [v for (i,v) in enumerate(row) if i not in col_list]
writer.writerow(output)
将以下内容写入newfile.csv:
7,9,10,11,12,13,14,15
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
你可以做这样的事情。
假设您的输入文件名为input.txt
with open('input.txt', 'r') as f:
data = [k.split(',') for k in f.read().splitlines()]
for k in data:
print(k[7] + ',' + ','.join(k[9:]))
而且,如果要将结果保存到文件(例如, final_file.txt
)中,则可以执行以下操作:
with open("final_file.txt", 'a') as f:
for k in data:
f.write(k[7] + ',' + ','.join(k[9:]) + '\n')
输出:
7,9,10,11,12,13,14,15
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
您可以尝试使用pandas drop
特定的列,然后写入csv文件:
import pandas as pd
df = pd.read_csv('test.csv')
df = df.drop(['0','1','2','3','4','5','6','8'], axis=1)
df.to_csv('newfile.csv',index=False)
newfile.csv
将是:
7,9,10,11,12,13,14,15
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
,,,,,,,
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.