[英]split file by lines and keep the first string as a header for output files
I have the file that look like that:我有这样的文件:
t_#_3_0 v_0_17 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2
t_#_3_1 v_0_144 v_1_17 v_2_20 u_0_1 u_0_2 u_1_2
t_#_3_2 v_0_143 v_1_233 v_2_238 u_0_1 u_0_2 u_1_2
t_#_3_3 v_0_20 v_1_253 v_2_275 u_0_1 u_0_2 u_1_2
t_#_3_4 v_0_144 v_1_209 v_2_90 u_0_1 u_0_2 u_1_2
t_#_3_5 v_0_144 v_1_209 v_2_30 u_0_1 u_0_2 u_1_2
t_#_3_6 v_0_19 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2
t_#_3_7 v_0_20 v_1_7 v_2_78 u_0_1 u_0_2 u_1_2
t_#_3_8 v_0_16 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2
t_#_3_9 v_0_15 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2
I would like to split it line by line and keep the first string as headers for my new files.我想逐行拆分它并将第一个字符串保留为我的新文件的标题。 My desired output should look like that.我想要的 output 应该是这样的。
file 1: t_#_3_0.txt (inside located line - t_#_3_0 v_0_17 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2)
file 2: t_#_3_1.txt
file 3: t_#_3_2.txt
I tried split
command, but it results in numerical indexes for output files.我尝试了split
命令,但它会生成 output 个文件的数字索引。
I would appreciate your suggestions!我会很感激你的建议!
Thank you!谢谢!
Olha奥尔哈
With GNU awk:使用 GNU awk:
awk '{name=$1 ".txt"; print >name; close(name)}' file
Variable name
contains content of first column with suffix .txt
.变量name
包含第一列的内容,后缀为.txt
。 print >name
writes complete current row to file name
. print >name
将完整的当前行写入文件name
。
My understanding of the requirements:我对需求的理解:
One awk
solution (assuming raw data is in file file.all
):一个awk
解决方案(假设原始数据在文件file.all
中):
$ awk '{ fn=$1".txt" ; print > fn ; close (fn) } ' file.all
$ for f in t*#*txt
do
echo "+++++++++++++ $f"
cat $f
done
+++++++++++++ t_#_3_0.txt
t_#_3_0 v_0_17 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2
+++++++++++++ t_#_3_1.txt
t_#_3_1 v_0_144 v_1_17 v_2_20 u_0_1 u_0_2 u_1_2
+++++++++++++ t_#_3_2.txt
t_#_3_2 v_0_143 v_1_233 v_2_238 u_0_1 u_0_2 u_1_2
+++++++++++++ t_#_3_3.txt
t_#_3_3 v_0_20 v_1_253 v_2_275 u_0_1 u_0_2 u_1_2
+++++++++++++ t_#_3_4.txt
t_#_3_4 v_0_144 v_1_209 v_2_90 u_0_1 u_0_2 u_1_2
+++++++++++++ t_#_3_5.txt
t_#_3_5 v_0_144 v_1_209 v_2_30 u_0_1 u_0_2 u_1_2
+++++++++++++ t_#_3_6.txt
t_#_3_6 v_0_19 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2
+++++++++++++ t_#_3_7.txt
t_#_3_7 v_0_20 v_1_7 v_2_78 u_0_1 u_0_2 u_1_2
+++++++++++++ t_#_3_8.txt
t_#_3_8 v_0_16 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2
+++++++++++++ t_#_3_9.txt
t_#_3_9 v_0_15 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2
These results kinda surprised me.这些结果让我有点吃惊。 :) :)
sed
Just to pull something weird out of the toolbox as an example of why you should evaluate your methods now and then...只是从工具箱中取出一些奇怪的东西作为一个例子,说明为什么你应该不时评估你的方法......
$: time sed -En 's/^([^ ]+)( .*)$/printf "%s%s\n" "\1" "\2" > \1.txt/e' file
real 0m0.859s
user 0m0.183s
sys 0m0.480s
I figured that would be slow, but it's nice to have in the toolbox just in case.我认为那会很慢,但为了以防万一,最好放在工具箱中。 Don't use pliers to drive nails.不要用钳子钉钉子。
awk
$: time awk '{ fn=$1".txt" ; print > fn ; close (fn) } ' file
real 0m0.141s
user 0m0.031s
sys 0m0.077s
Predictably faster, and likely a lot for a larger file.可以预见的更快,对于更大的文件来说可能更快。
But the one that surprised me -但让我惊讶的是——
bash
纯bash
$: time while read line; do echo "$line" > "${line%%[ ]*}"; done < file
real 0m0.015s
user 0m0.000s
sys 0m0.016s
"${line%%[ ]*}"
has both a space and a tab between the square brackets [ ]
as field delimiters. "${line%%[ ]*}"
在方括号[ ]
之间有一个空格和一个制表符作为字段分隔符。
I think the time gain here is in the file management overhead.我认为这里的时间增益是文件管理开销。 read
is notoriously slow, but the OS is good at handling its own I/O, I suppose. read
是出了名的慢,但我想操作系统擅长处理自己的 I/O。
Maybe someone can chime in with a more in-depth analysis?也许有人可以插话进行更深入的分析?
Elaborating the test out to a sample size of 10k records closes the gap between bash
and awk
a lot -将测试细化到 10k 条记录的样本大小可以大大缩小bash
和awk
之间的差距 -
$: for x in {0..9999}; do echo "t_${x}_3_0 v_0_17 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2"; done >| file
$: time while read line; do echo "$line" > "${line%% *}"; done < file
real 0m24.022s
user 0m2.360s
sys 0m11.938s
$: time awk '{ fn=$1".txt" ; print > fn ; close (fn) } ' file
real 0m27.284s
user 0m1.312s
sys 0m12.656s
$: $: time sed -En 's/^([^ ]+)( .*)$/printf "%s%s\n" "\1" "\2" > \1.txt/e' file
real 13m28.503s
user 1m48.374s
sys 8m22.970s
I suspect a lot of the difference with a smaller dataset is awk
's compile time, maybe?我怀疑较小数据集的很多差异可能是awk
的编译时间?
Testing on 100k to confirm -测试 100k 以确认 -
$: wc -c file # >5.5MB
5788890 file
bash
: bash
:
real 8m42.666s
user 0m28.671s
sys 2m34.781s
awk
: awk
:
real 8m15.096s
user 0m15.546s
sys 2m35.421s
I'm really surprised at how little difference there is.我真的很惊讶有这么小的差异。
I assume it's because most of the time is file I/O operations.我认为这是因为大部分时间是文件 I/O 操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.