[英]How to break large “.csv” file into small files based on multiple conditions?
我有大的.csv文件(〜40MB),我希望在少數情況下將它們分解成較小的文件,並根據數據將其命名為:
這是棘手的部分:
我在VBA中有這樣的東西,但是對於大文件來說它太慢了,而excel有時崩潰了。 對於像這樣的多個文件,需要花費很多時間才能手動將其縮減,然后將其放入vba中。
是否有可能在如此多的條件下剪切文件?
在此先感謝您的幫助。
例如:(標題為列數)
1 2 3 4 11 15 17
Date Time COUNTRY CITY CHECK TEST TEST2
2015-08-20 11:54 ENGLAND ABINGDON 1 1
2015-08-21 12:54 ENGLAND BATLEY 2 5
2015-08-22 13:54 ENGLAND FROME 2 6
2015-08-23 14:54 ENGLAND FROME 2 1
2015-08-24 15:54 USA CALIFORNIA 4 8
2015-08-25 16:54 USA CONNECTICUT 4 9
2015-08-26 17:54 USA DELAWARE 1 8
2015-08-27 18:54 GERMANY SAXONY 6 9
2015-08-28 19:54 GERMANY SAXONY 6 10
2015-08-27 18:54 GERMANY SAXONY 4 11
2015-08-28 19:54 GERMANY SAXONY 4 14
2015-08-29 20:54 GERMANY HESSE 8
2015-08-29 20:54 GERMANY HESSE 1 8
File1
2015-08-20 11:54 ENGLAND ABINGDON 1 1
File2
2015-08-21 12:54 ENGLAND BATLEY 2 5
File3
2015-08-22 13:54 ENGLAND FROME 2 6
File4
2015-08-23 14:54 ENGLAND FROME 2 1
File5
2015-08-24 15:54 USA CALIFORNIA 4 8
File6
2015-08-25 16:54 USA CONNECTICUT 4 9
File7
2015-08-26 17:54 USA DELAWARE 1 8
File8
2015-08-27 18:54 GERMANY SAXONY 4 9
2015-08-28 19:54 GERMANY SAXONY 4 10
File9
2015-08-27 18:54 GERMANY SAXONY 6 11
2015-08-28 19:54 GERMANY SAXONY 6 14
File10
2015-08-29 20:54 GERMANY HESSE 8
File11
2015-08-29 20:54 GERMANY HESSE 1 8
您的數據無處不在! 它不在您描述的列中,也不用制表符分隔。 您讓生活變得輕松!
將此awk
與實際數據一起嘗試,以查看它是否生成了可以使用的輸出文件名。
awk -F'\t' '{
f=$3 "_" $4 # filename = field3 _ field4
if(length($11)){ # if f11 not null
f=f "_A_" $11 "_" $17 # filename = filename _A_ field11 _ field17
}else{ # else
f=f "_B_" $15 "_" $17 # filename = filename _B_ field15 _ field17
}
print f}' file.csv
你應該得到這樣的東西
ENGLAND_ABINGDON_A_3_1
ENGLAND_ABINGDON_A_4_2
GENRMANY_SAXONY_B_5_3
基本上,它使用awk
並告訴它字段分隔符是制表符。 然后,它查看每一行,並通過查看字段的描述在變量f
構建輸出文件名。
如果這將您的數據按您的意思分開,則實際上您可以通過簡單地更改最后一行來將當前行輸出到具有相應名稱的文件中:
awk -F'\t' '{
f=$3 "_" $4 # filename = field3 _ field4
if(length($11)){ # if f11 not null
f=f "_A_" $11 "_" $17 # filename = filename _A_ field11 _ field17
}else{ # else
f=f "_B_" $15 "_" $17 # filename = filename _B_ field15 _ field17
}
print > f}' file.csv
基本上,它會打印到文件,而不是在更改時打印其名稱
print f
至
print > f
計算標題
如果您想在每個輸出文件上使用標題,我們將需要更加努力地工作...
首先,我們需要保存原始文件中的標頭,因此如果我們假設記錄號為1,我們將
awk -F'\t' '
NR==1 {header=$0} # save first line as header
{f=$3 "_" $4 # filename = field3 _ field4
...
... as before
現在,每當開始寫入新文件時,我們都需要輸出標題行,這是“有趣的”,因為我們只是為每行動態創建輸出文件名! 因此,我們需要“記住”已寫入的文件,然后僅在寫入新文件時才發出標頭。 我在這里沒有合適的數據集,所以我猜這有點!
awk -F'\t' '
NR==1 {header=$0} # save first line as header
{f=$3 "_" $4 # filename = field3 _ field4
if(length($11)){ # if f11 not null
f=f "_A_" $11 "_" $17 # filename = filename _A_ field11 _ field17
}else{ # else
f=f "_B_" $15 "_" $17 # filename = filename _B_ field15 _ field17
}
# Emit header if first write to this filename
if(!(f in fileswritten)){
fileswritten[f]++ # note that we have written to this file
print header > f # emit header
}
print > f}' file.csv
這個答案是不完整的,但是粗略地說明了您需要做什么:
#!/bin/bash
# Get list of countries:
countries=`cat file1.csv | cut -f 3 -d$'\t'| grep -v 3 | grep -v COUNTRY | uniq`
for country in ${countries}; do
# Get list of cities per country:
cities=`cat file1.csv | grep ${country} | cut -f 4 -d$'\t' | uniq`
# Get data per country:
cat file1.csv | grep ${country} > file1-${country}.csv
# Get data per city per country:
for city in ${cities}; do
echo ${country}:${city}
cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}.csv
done
# Created output by 2 previous operations check if there is any data in 11th column,
# if yes then separate this data accordingly to content and after that separate that
# by content of 17th column -> then save outputs /OR / AND /
# Column 11 is at position 5 in your data.
check=`cat file1.csv | grep ${country} | cut -f 5 -d$'\t' | uniq`
for check in ${checks}; do
echo ${country}:${city}:${check}
cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}-${check}.csv
# TODO: Further split this, I assume you get the drift by now.
done
# If there is no data in column 11 check column 15th and separate accordingly.
# Next check 17 column and separate this data by 17th column -> save outputs.
# TODO: Further split this, I assume you get the drift by now.
done
我建議編寫一個小腳本並使用Java庫CSVFormat:
private static final String[] FILE_HEADER_MAPPING = {"Date", "Time" ,"COUNTRY", .... };
csvFileParser = new CSVParser(fileReader, csvFileFormat);
List<CSVRecord> csvRecords = csvFileParser.getRecords();
然后訪問第11列,您必須做
for (int i = 1; i < csvRecords.size(); i++) {
boolean publishAccount = true;
CSVRecord record = csvRecords.get(i);
/// here how to access
record.get("Fiel column 11");
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.