按行拆分文件并将第一个字符串保留为 output 文件的 header

Question

I have the file that look like that:我有这样的文件：

t_#_3_0 v_0_17  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
t_#_3_1 v_0_144 v_1_17  v_2_20  u_0_1   u_0_2   u_1_2
t_#_3_2 v_0_143 v_1_233 v_2_238 u_0_1   u_0_2   u_1_2
t_#_3_3 v_0_20  v_1_253 v_2_275 u_0_1   u_0_2   u_1_2
t_#_3_4 v_0_144 v_1_209 v_2_90  u_0_1   u_0_2   u_1_2
t_#_3_5 v_0_144 v_1_209 v_2_30  u_0_1   u_0_2   u_1_2
t_#_3_6 v_0_19  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
t_#_3_7 v_0_20  v_1_7   v_2_78  u_0_1   u_0_2   u_1_2
t_#_3_8 v_0_16  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
t_#_3_9 v_0_15  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2

I would like to split it line by line and keep the first string as headers for my new files.我想逐行拆分它并将第一个字符串保留为我的新文件的标题。 My desired output should look like that.我想要的 output 应该是这样的。

file 1: t_#_3_0.txt (inside located line - t_#_3_0 v_0_17 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2)
file 2: t_#_3_1.txt
file 3: t_#_3_2.txt

I tried split command, but it results in numerical indexes for output files.我尝试了split命令，但它会生成 output 个文件的数字索引。

I would appreciate your suggestions!我会很感激你的建议！

Thank you!谢谢！

Olha奥尔哈

Answer 1

With GNU awk:使用 GNU awk：

awk '{name=$1 ".txt"; print >name; close(name)}' file

Variable name contains content of first column with suffix .txt .变量name包含第一列的内容，后缀为.txt 。 print >name writes complete current row to file name . print >name将完整的当前行写入文件name 。

Answer 2

My understanding of the requirements:我对需求的理解：

each line in source file is to be copied to a new file源文件中的每一行都将被复制到一个新文件中
new file is named after the contents of field (column) #1新文件以字段（列）#1 的内容命名

One awk solution (assuming raw data is in file file.all ):一个awk解决方案（假设原始数据在文件file.all中）：

$ awk '{ fn=$1".txt" ; print > fn ; close (fn) } ' file.all
$ for f in t*#*txt
do
echo "+++++++++++++ $f"
cat $f
done

+++++++++++++ t_#_3_0.txt
t_#_3_0 v_0_17  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_1.txt
t_#_3_1 v_0_144 v_1_17  v_2_20  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_2.txt
t_#_3_2 v_0_143 v_1_233 v_2_238 u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_3.txt
t_#_3_3 v_0_20  v_1_253 v_2_275 u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_4.txt
t_#_3_4 v_0_144 v_1_209 v_2_90  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_5.txt
t_#_3_5 v_0_144 v_1_209 v_2_30  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_6.txt
t_#_3_6 v_0_19  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_7.txt
t_#_3_7 v_0_20  v_1_7   v_2_78  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_8.txt
t_#_3_8 v_0_16  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_9.txt
t_#_3_9 v_0_15  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2

Answer 3

These results kinda surprised me.这些结果让我有点吃惊。 :) :)

`sed`

Just to pull something weird out of the toolbox as an example of why you should evaluate your methods now and then...只是从工具箱中取出一些奇怪的东西作为一个例子，说明为什么你应该不时评估你的方法......

$: time sed -En 's/^([^ ]+)( .*)$/printf "%s%s\n" "\1" "\2" > \1.txt/e' file
real    0m0.859s
user    0m0.183s
sys     0m0.480s

I figured that would be slow, but it's nice to have in the toolbox just in case.我认为那会很慢，但为了以防万一，最好放在工具箱中。 Don't use pliers to drive nails.不要用钳子钉钉子。

`awk`

$: time awk '{ fn=$1".txt" ; print > fn ; close (fn) } ' file
real    0m0.141s
user    0m0.031s
sys     0m0.077s

Predictably faster, and likely a lot for a larger file.可以预见的更快，对于更大的文件来说可能更快。

But the one that surprised me -但让我惊讶的是——

Pure `bash`纯`bash`

$: time while read line; do echo "$line" > "${line%%[   ]*}"; done < file
real    0m0.015s
user    0m0.000s
sys     0m0.016s

NOTE - edited* to use spaces AND/OR tabs*注意 -编辑为使用空格和/或制表符

"${line%%[ ]*}" has both a space and a tab between the square brackets [ ] as field delimiters. "${line%%[ ]*}"在方括号[ ]之间有一个空格和一个制表符作为字段分隔符。

I think the time gain here is in the file management overhead.我认为这里的时间增益是文件管理开销。 read is notoriously slow, but the OS is good at handling its own I/O, I suppose. read是出了名的慢，但我想操作系统擅长处理自己的 I/O。

Maybe someone can chime in with a more in-depth analysis?也许有人可以插话进行更深入的分析？

Elaborating the test out to a sample size of 10k records closes the gap between bash and awk a lot -将测试细化到 10k 条记录的样本大小可以大大缩小bash和awk之间的差距 -

$: for x in {0..9999}; do echo "t_${x}_3_0 v_0_17  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2"; done >| file

$: time while read line; do echo "$line" > "${line%% *}"; done < file
real    0m24.022s
user    0m2.360s
sys     0m11.938s

$: time awk '{ fn=$1".txt" ; print > fn ; close (fn) } ' file
real    0m27.284s
user    0m1.312s
sys     0m12.656s

$: $: time sed -En 's/^([^ ]+)( .*)$/printf "%s%s\n" "\1" "\2" > \1.txt/e' file
real    13m28.503s
user    1m48.374s
sys     8m22.970s

I suspect a lot of the difference with a smaller dataset is awk 's compile time, maybe?我怀疑较小数据集的很多差异可能是awk的编译时间？

Testing on 100k to confirm -测试 100k 以确认 -

$: wc -c file # >5.5MB
5788890 file

bash : bash :

real    8m42.666s
user    0m28.671s
sys     2m34.781s

awk : awk :

real    8m15.096s
user    0m15.546s
sys     2m35.421s

I'm really surprised at how little difference there is.我真的很惊讶有这么小的差异。
I assume it's because most of the time is file I/O operations.我认为这是因为大部分时间是文件 I/O 操作。

按行拆分文件并将第一个字符串保留为 output 文件的 header

问题描述

3 个解决方案

解决方案1
3 2020-08-21 17:21:28

解决方案2
2 2020-08-21 17:22:35

解决方案3
2 已采纳 2020-08-21 18:42:14

`sed`

`awk`

Pure `bash`纯`bash`

NOTE - edited* to use spaces AND/OR tabs*注意 -编辑为使用空格和/或制表符

按行拆分文件并将第一个字符串保留为 output 文件的 header

问题描述

3 个解决方案

解决方案1 3 2020-08-21 17:21:28

解决方案2 2 2020-08-21 17:22:35

解决方案3 2 已采纳 2020-08-21 18:42:14

sed

awk

Pure bash纯bash

NOTE - edited to use spaces AND/OR tabs注意 -编辑为使用空格和/或制表符

解决方案1
3 2020-08-21 17:21:28

解决方案2
2 2020-08-21 17:22:35

解决方案3
2 已采纳 2020-08-21 18:42:14

`sed`

`awk`

Pure `bash`纯`bash`

NOTE - edited* to use spaces AND/OR tabs*注意 -编辑为使用空格和/或制表符