简体   繁体   English

按行拆分文件并将第一个字符串保留为 output 文件的 header

[英]split file by lines and keep the first string as a header for output files

I have the file that look like that:我有这样的文件:

t_#_3_0 v_0_17  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
t_#_3_1 v_0_144 v_1_17  v_2_20  u_0_1   u_0_2   u_1_2
t_#_3_2 v_0_143 v_1_233 v_2_238 u_0_1   u_0_2   u_1_2
t_#_3_3 v_0_20  v_1_253 v_2_275 u_0_1   u_0_2   u_1_2
t_#_3_4 v_0_144 v_1_209 v_2_90  u_0_1   u_0_2   u_1_2
t_#_3_5 v_0_144 v_1_209 v_2_30  u_0_1   u_0_2   u_1_2
t_#_3_6 v_0_19  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
t_#_3_7 v_0_20  v_1_7   v_2_78  u_0_1   u_0_2   u_1_2
t_#_3_8 v_0_16  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
t_#_3_9 v_0_15  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2

I would like to split it line by line and keep the first string as headers for my new files.我想逐行拆分它并将第一个字符串保留为我的新文件的标题。 My desired output should look like that.我想要的 output 应该是这样的。

file 1: t_#_3_0.txt (inside located line - t_#_3_0 v_0_17 v_1_20 v_2_78 u_0_1 u_0_2 u_1_2)
file 2: t_#_3_1.txt
file 3: t_#_3_2.txt

I tried split command, but it results in numerical indexes for output files.我尝试了split命令,但它会生成 output 个文件的数字索引。

I would appreciate your suggestions!我会很感激你的建议!

Thank you!谢谢!

Olha奥尔哈

With GNU awk:使用 GNU awk:

awk '{name=$1 ".txt"; print >name; close(name)}' file

Variable name contains content of first column with suffix .txt .变量name包含第一列的内容,后缀为.txt print >name writes complete current row to file name . print >name将完整的当前行写入文件name

My understanding of the requirements:我对需求的理解:

  • each line in source file is to be copied to a new file源文件中的每一行都将被复制到一个新文件中
  • new file is named after the contents of field (column) #1新文件以字段(列)#1 的内容命名

One awk solution (assuming raw data is in file file.all ):一个awk解决方案(假设原始数据在文件file.all中):

$ awk '{ fn=$1".txt" ; print > fn ; close (fn) } ' file.all
$ for f in t*#*txt
do
echo "+++++++++++++ $f"
cat $f
done

+++++++++++++ t_#_3_0.txt
t_#_3_0 v_0_17  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_1.txt
t_#_3_1 v_0_144 v_1_17  v_2_20  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_2.txt
t_#_3_2 v_0_143 v_1_233 v_2_238 u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_3.txt
t_#_3_3 v_0_20  v_1_253 v_2_275 u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_4.txt
t_#_3_4 v_0_144 v_1_209 v_2_90  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_5.txt
t_#_3_5 v_0_144 v_1_209 v_2_30  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_6.txt
t_#_3_6 v_0_19  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_7.txt
t_#_3_7 v_0_20  v_1_7   v_2_78  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_8.txt
t_#_3_8 v_0_16  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2
+++++++++++++ t_#_3_9.txt
t_#_3_9 v_0_15  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2

These results kinda surprised me.这些结果让我有点吃惊。 :) :)

sed

Just to pull something weird out of the toolbox as an example of why you should evaluate your methods now and then...只是从工具箱中取出一些奇怪的东西作为一个例子,说明为什么你应该不时评估你的方法......

$: time sed -En 's/^([^ ]+)( .*)$/printf "%s%s\n" "\1" "\2" > \1.txt/e' file
real    0m0.859s
user    0m0.183s
sys     0m0.480s

I figured that would be slow, but it's nice to have in the toolbox just in case.我认为那会很慢,但为了以防万一,最好放在工具箱中。 Don't use pliers to drive nails.不要用钳子钉钉子。

awk

$: time awk '{ fn=$1".txt" ; print > fn ; close (fn) } ' file
real    0m0.141s
user    0m0.031s
sys     0m0.077s

Predictably faster, and likely a lot for a larger file.可以预见的更快,对于更大的文件来说可能更快。

But the one that surprised me -但让我惊讶的是——

Pure bashbash

$: time while read line; do echo "$line" > "${line%%[   ]*}"; done < file
real    0m0.015s
user    0m0.000s
sys     0m0.016s

NOTE - edited to use spaces AND/OR tabs注意 -编辑为使用空格和/或制表符

"${line%%[ ]*}" has both a space and a tab between the square brackets [ ] as field delimiters. "${line%%[ ]*}"在方括号[ ]之间有一个空格和一个制表符作为字段分隔符。

I think the time gain here is in the file management overhead.我认为这里的时间增益是文件管理开销。 read is notoriously slow, but the OS is good at handling its own I/O, I suppose. read是出了名的慢,但我想操作系统擅长处理自己的 I/O。

Maybe someone can chime in with a more in-depth analysis?也许有人可以插话进行更深入的分析?

Elaborating the test out to a sample size of 10k records closes the gap between bash and awk a lot -将测试细化到 10k 条记录的样本大小可以大大缩小bashawk之间的差距 -

$: for x in {0..9999}; do echo "t_${x}_3_0 v_0_17  v_1_20  v_2_78  u_0_1   u_0_2   u_1_2"; done >| file

$: time while read line; do echo "$line" > "${line%% *}"; done < file
real    0m24.022s
user    0m2.360s
sys     0m11.938s

$: time awk '{ fn=$1".txt" ; print > fn ; close (fn) } ' file
real    0m27.284s
user    0m1.312s
sys     0m12.656s

$: $: time sed -En 's/^([^ ]+)( .*)$/printf "%s%s\n" "\1" "\2" > \1.txt/e' file
real    13m28.503s
user    1m48.374s
sys     8m22.970s

I suspect a lot of the difference with a smaller dataset is awk 's compile time, maybe?我怀疑较小数据集的很多差异可能是awk的编译时间?

Testing on 100k to confirm -测试 100k 以确认 -

$: wc -c file # >5.5MB
5788890 file

bash : bash :

real    8m42.666s
user    0m28.671s
sys     2m34.781s

awk : awk :

real    8m15.096s
user    0m15.546s
sys     2m35.421s

I'm really surprised at how little difference there is.我真的很惊讶有这么小的差异。
I assume it's because most of the time is file I/O operations.我认为这是因为大部分时间是文件 I/O 操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM