简体   繁体   English

如何从.csv文件中删除列而不读取整个文件

[英]How to delete column from .csv file without reading the whole file

I generate very big .csv file but now it doesn't fit the RAM. 我生成了很大的.csv文件,但现在不适合RAM。 So i decided to delete some inefficient columns to reduce the file size. 因此,我决定删除一些无效的列以减小文件大小。 How can I do that? 我怎样才能做到这一点?

I tried data = pd.read_csv("file.csv", index_col=0, usecols=["id", "wall"]) but it still doesn't fit the RAM. 我尝试了data = pd.read_csv("file.csv", index_col=0, usecols=["id", "wall"])但它仍然不适合RAM。

File is about 1.5GB, RAM is 8GB. 文件约为1.5GB,RAM为8GB。

Instead of deleting columns, you can also read specific columns from csv file using a DictReader (if you're not using Pandas ). 除了删除列,您还可以使用DictReader从csv文件读取特定的列(如果您不使用Pandas )。

import csv
from StringIO import StringIO

columns = 'AAA,DDD,FFF,GGG'.split(',')


testdata ='''\
AAA,bbb,ccc,DDD,eee,FFF,GGG,hhh
1,2,3,4,50,3,20,4
2,1,3,5,24,2,23,5
4,1,3,6,34,1,22,5
2,1,3,5,24,2,23,5
2,1,3,5,24,2,23,5
'''

reader = csv.DictReader(StringIO(testdata))

desired_cols = (tuple(row[col] for col in columns) for row in reader)

Output: 输出:

>>> list(desired_cols)
[('1', '4', '3', '20'),
 ('2', '5', '2', '23'),
 ('4', '6', '1', '22'),
 ('2', '5', '2', '23'),
 ('2', '5', '2', '23')]

Source: https://stackoverflow.com/a/20065131/6633975 资料来源: https : //stackoverflow.com/a/20065131/6633975

Using Pandas: 使用熊猫:

Here is an example illustrating the answer given by EdChum. 这是一个示例,说明EdChum给出的答案。 There is a lot of additional options to load a CSV file, check the API reference . 要加载CSV文件,还有很多其他选项,请参阅API参考

import pandas as pd


raw_data = {'first_name': ['Steve', 'Guido', 'John'],
        'last_name': ['Jobs', 'Van Rossum', "von Neumann"]}
df = pd.DataFrame(raw_data)
# Saving data without header
df.to_csv(path_or_buf='test.csv', header=False)
# Telling that there is no header and loading only the first name
df = pd.read_csv(filepath_or_buffer='test.csv', header=None, usecols=[1], names=['first_name'])
df

  first_name
0      Steve
1      Guido
2       John

I am not sure if this is possible in pandas. 我不确定熊猫是否有可能。 You can try to do it in the Command Line. 您可以尝试在命令行中执行此操作。 On Linux it will look like: 在Linux上,它看起来像:

cut -f1,2,5- inputfile

if you want to delete columns with indexes 3 and 4. 如果要删除索引为3和4的列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM