[英]Rename a column header in csv using python pandas
I have some giant CSV files - like 23 GB size - in which i want to accomplish this with their column headers - 我有一些巨型CSV文件(例如23 GB大小),在其中我要使用其列标题来完成此操作-
If there is a column name SFID, perform this - Rename column "Id" to "IgnoreId" Rename column "SFID" to "Id" else- Do nothing 如果列名称为SFID,请执行以下操作-将列“ Id”重命名为“ IgnoreId”,将列“ SFID”重命名为“ Id”,否则-不执行任何操作
All the google search results i see are about how to import the csv in a dataframe, rename the column, export it back into a csv. 我看到的所有Google搜索结果都是关于如何在数据框中导入csv,重命名该列,将其导出回csv的信息。
To me it feels like giant waste of time/memory, because we are effectively just working with very first row of the CSV file (which represents headers). 对我来说,这感觉像是在浪费时间/内存,因为我们实际上只是在处理CSV文件的第一行(代表标题)。 I dont know if it is necessary to load whole csv as dataframe and export to a new csv (or export it to same csv, effectively overwriting it). 我不知道是否有必要将整个csv作为数据帧加载并导出到新的csv(或将其导出到相同的csv,有效地覆盖它)。
Being huge CSVs, i have to load them in small chunksize and perform the operation which takes time and memory. 作为巨大的CSV,我必须以小块大小加载它们并执行需要时间和内存的操作。 Again, feels liek waste of memory becuase apart from the headers, we are not really doing anything with remaining chunksizes 再说一次,除了头文件之外,还有一点浪费的内存,因为我们实际上并没有对剩余的块大小做任何事情
Is there a way i just load up header of a csv file, make changes to headers, and save it back into same csv file? 有没有办法我只加载一个csv文件的标头,更改标头,然后将其保存回同一csv文件中?
I am open to ideas of using something other that pandas as well. 我愿意使用熊猫以外的其他东西。 Only real constraint is that CSV files are too big to just double click and open. 唯一真正的限制是CSV文件太大而无法双击并打开。
Write the header row first and copy the data rows using shutil.copyfileobj 首先写标题行,然后使用shutil.copyfileobj复制数据行
shutil.copyfileobj took 38 seconds for a 0.5 GB file whereas fileinput took 125 seconds for the same. shutil.copyfileobj用了38秒,0.5 GB的文件而的FileInput了125秒对于相同。
Using shutil.copyfileobj 使用shutil.copyfileobj
df = pd.read_csv(filename, nrows=0) # read only the header row
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns) + "\n"
# modify header in csv file
with open(filename, "r+") as f1, open(filename, "r+") as f2:
f1.readline() # to move the pointer after header row
f2.write(header_row)
shutil.copyfileobj(f1, f2) # copies the data rows
Using fileinput 使用文件输入
if 'SFID' in df.columns:
# rename columns
df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
# construct new header row
header_row = ','.join(df.columns)
# modify header in csv file
f = fileinput.input(filename, inplace=True)
for line in f:
if fileinput.isfirstline():
print(header_row)
else:
print(line, end = '')
f.close()
For huge file a simple command line solution with the stream editor sed
might be faster than a python script: 对于大文件,使用流编辑器sed
的简单命令行解决方案可能比python脚本快:
sed -e '1 {/SFID/ {s/Id/IgnoreId/; s/SFID/Id/}}' -i myfile.csv
This changes Id
to IgnoreId
and SFID
to Id
in the first line if it contains SFID
. 如果包含SFID
,则在第一行IgnoreId
Id
更改为IgnoreId
,将SFID
更改为Id
。 If other column header also contain the string Id
(eg ImportantId
) then you'll have to refine the regexes in the s
command accordingly. 如果其他列标题也包含字符串Id
(例如ImportantId
),则必须相应地在s
命令中优化正则表达式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.