简体   繁体   English

拆分一个大的txt文件以执行grep-UNIX

[英]Split a big txt file to do grep - unix

I work (unix, shell scripts) with txt files that are millions field separate by pipe and not separated by \\n or \\r . 我使用txt文件(unix,shell脚本),这些txt文件通过管道用数百万个字段分隔,而不用\\n\\r分隔。 something like this: 像这样的东西:

field1a|field2a|field3a|field4a|field5a|field6a|[...]|field1d|field2d|field3d|field4d|field5d|field6d|[...]|field1m|field2m|field3m|field4m|field5m|field6m|[...]|field1z|field2z|field3z|field4z|field5z|field6z|

All text is in the same line. 所有文本都在同一行中。

The number of fields is fixed for every file. 每个文件的字段数是固定的。

(in this example I have field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype ) (在此示例中,我有field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype

When I need to find a field (ex field2 ), command like grep doesn't work (in the same line). 当我需要查找一个字段(例如field2 )时,像grep这样的命令不起作用(在同一行中)。

I think that a good solution can be do a script that split every 6 field with a "\\n" and after do a grep. 我认为一个好的解决方案可以是使用“ \\ n”分割每6个字段的脚本,然后再执行grep。 I'm right? 我是正确的? Thank you very much! 非常感谢你!

With awk : 用awk:

$ cat a
field1a|field2a|field3a|field4a|field5a|field6a|field1d|field2d|field3d|field4d|field5d|field6d|field1m|field2m|field3m|field4m|field5m|field6m|field1z|field2z|field3z|field4z|field5z|field6z|



$ awk -F"|" '{for (i=1;i<NF;i=i+6) {for (j=0; j<6; j++) printf $(i+j)"|"; printf "\n"}}' a

field1a|field2a|field3a|field4a|field5a|field6a|
field1d|field2d|field3d|field4d|field5d|field6d|
field1m|field2m|field3m|field4m|field5m|field6m|
field1z|field2z|field3z|field4z|field5z|field6z|

Here you can easily set the length of line. 在这里,您可以轻松设置行的长度。

Hope this helps ! 希望这可以帮助 !

you can use sed to split the line in multiple lines: 您可以使用sed将行拆分为多行:

 sed 's/\(\([^|]*|\)\{6\}\)/\1\n/g' input.txt > output.txt

explanation: 说明:

  • we have to use heavy backslash-escaping of (){} which makes the code slightly unreadable. 我们必须对(){}使用大量的反斜杠转义,这会使代码有些难以理解。

  • but in short: 简而言之:

    • the term (([^|]*|){6}) (backslashes removed for readability) between s/ and /\\1 , will match: s//\\1之间s/ (([^|]*|){6})术语(([^|]*|){6})为了可读性删除了反斜杠)将匹配:

      • [^|]* any character but '|', repeated multiple times [^|]*除'|'以外的任何字符,重复多次

      • | followed by a '|' 后跟一个“ |”

      • the above is obviously one column and it is grouped together with enclosing parantheses ( and ) 上面的内容显然是一栏,并与括起来的括号()组合在一起

      • the entire group is repeated 6 times {6} 整个组重复6次{6}

      • and this is again grouped together with enclosing parantheses ( and ) , to form one full set 然后再将其与括起来的括号()组合在一起,形成一个完整的集合

the rest of the term is easy to read: 该术语的其余部分很容易理解:

  • replace the above (the entire dataset of 6 fields) with \\1\\n , the part between / and /g 将上述(6个字段的整个数据集)替换为\\1\\n//g之间的部分)

  • \\1 refers to the "first" group in the sed-expression (the "first" group that is started, so it's the entire dataset of 6 fields) \\1表示sed-expression中的“第一个”组(已启动的“第一个”组,因此是6个字段的整个数据集)

  • \\n is the newline character \\n是换行符

  • so replace the entire dataset of 6 fields by itself followed by a newline 因此,请自行替换6个字段的整个数据集,然后再换行

  • and do so repeatedly (the trailing g ) 并重复这样做(结尾的g

you can use sed to convert every 6th | 您可以使用sed每6转换一次| to a newline. 换行。

In my version of tcsh I can do: 在我的tcsh版本中,我可以执行以下操作:

sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' filename

consider this: 考虑一下:

> cat bla
a1|b2|c3|d4|

> sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' bla
a1|b2|
c3|d4|

This is how the regex works: 正则表达式的工作方式如下:

  • [^|] is any non- | [^|]是非| character. 字符。
  • [^|]\\+ is a sequence of at least one non- | [^|]\\+是至少一个非|的序列。 characters. 字符。
  • [^|]\\+| is a sequence of at least one non- | 是至少一种非序列| characters followed by a | 字符后跟| .
  • \\([^|]\\+|\\) is a sequence of at least one non- | \\([^|]\\+|\\)是至少一个非|的序列。 characters followed by a | 字符后跟| , grouped together ,分组在一起
  • \\([^|]\\+|\\)\\{6\\} is 6 consecutive such groups. \\([^|]\\+|\\)\\{6\\}是6个连续的此类组。
  • \\(\\([^|]\\+|\\)\\{6\\}\\) is 6 consecutive such groups, grouped together. \\(\\([^|]\\+|\\)\\{6\\}\\)是6个连续的这样的组,被分组在一起。

The replacement just takes this sequence of 6 groups and adds a newline to the end. 替换仅需按6个组的顺序进行,并在末尾添加换行符。

Here is how I would do it with awk 这是我将如何使用awk做到这一点

awk -v RS="|" '{printf $0 (NR%7?RS:"\n")}' file
field1a|field2a|field3a|field4a|field5a|field6a|[...]
field1d|field2d|field3d|field4d|field5d|field6d|[...]
field1m|field2m|field3m|field4m|field5m|field6m|[...]
field1z|field2z|field3z|field4z|field5z|field6z|

Just adjust the NR%7 to number of field you to what suites you. 只需将NR%7调整为适合您的字段数即可。

What about printing the lines on blocks of six? 怎样将行打印在六个块上?

$ awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}' file
field1a|field2a|field3a|field4a|field5a|field6a
field1d|field2d|field3d|field4d|field5d|field6d
field1m|field2m|field3m|field4m|field5m|field6m
field1z|field2z|field3z|field4z|field5z|field6z

Explanation 说明

  • BEGIN{FS=OFS="|"} set input and output field separator as | BEGIN{FS=OFS="|"}设置输入和输出字段分隔符为| .
  • {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}} loop through items on blocks of 6. Every single time, print six of them. {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}循环浏览6个块上的项目。每次打印六个。 As print end up writing a new line, then you are done. print最终写出新行时,就完成了。

If you want to treat the files as being in multiple lines, then make \\n the field separator. 如果要将文件视为多行,请使用\\n字段分隔符。 For example, to get the 2nd column, just do: 例如,要获取第二列,只需执行以下操作:

tr \| \\n < input-file | sed -n 2p

To see which columns match a regex, do: 要查看哪些列与正则表达式匹配,请执行以下操作:

tr \| \\n < input-file | grep -n regex 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM