拆分一个大的txt文件以执行grep-UNIX

Question

I work (unix, shell scripts) with txt files that are millions field separate by pipe and not separated by \\n or \\r . 我使用txt文件（unix，shell脚本），这些txt文件通过管道用数百万个字段分隔，而不用\\n或\\r分隔。 something like this: 像这样的东西：

field1a|field2a|field3a|field4a|field5a|field6a|[...]|field1d|field2d|field3d|field4d|field5d|field6d|[...]|field1m|field2m|field3m|field4m|field5m|field6m|[...]|field1z|field2z|field3z|field4z|field5z|field6z|

All text is in the same line. 所有文本都在同一行中。

The number of fields is fixed for every file. 每个文件的字段数是固定的。

(in this example I have field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype ) （在此示例中，我有field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype ）

When I need to find a field (ex field2 ), command like grep doesn't work (in the same line). 当我需要查找一个字段（例如field2 ）时，像grep这样的命令不起作用（在同一行中）。

I think that a good solution can be do a script that split every 6 field with a "\\n" and after do a grep. 我认为一个好的解决方案可以是使用“ \\ n”分割每6个字段的脚本，然后再执行grep。 I'm right? 我是正确的？ Thank you very much! 非常感谢你！

Answer 1

With awk : 用awk：

$ cat a
field1a|field2a|field3a|field4a|field5a|field6a|field1d|field2d|field3d|field4d|field5d|field6d|field1m|field2m|field3m|field4m|field5m|field6m|field1z|field2z|field3z|field4z|field5z|field6z|



$ awk -F"|" '{for (i=1;i<NF;i=i+6) {for (j=0; j<6; j++) printf $(i+j)"|"; printf "\n"}}' a

field1a|field2a|field3a|field4a|field5a|field6a|
field1d|field2d|field3d|field4d|field5d|field6d|
field1m|field2m|field3m|field4m|field5m|field6m|
field1z|field2z|field3z|field4z|field5z|field6z|

Here you can easily set the length of line. 在这里，您可以轻松设置行的长度。

Hope this helps ! 希望这可以帮助！

Answer 2

you can use sed to split the line in multiple lines: 您可以使用sed将行拆分为多行：

 sed 's/\(\([^|]*|\)\{6\}\)/\1\n/g' input.txt > output.txt

explanation: 说明：

we have to use heavy backslash-escaping of (){} which makes the code slightly unreadable. 我们必须对(){}使用大量的反斜杠转义，这会使代码有些难以理解。
but in short: 简而言之：
- the term (([^|]*|){6}) (backslashes removed for readability) between s/ and /\\1 , will match: s/和/\\1之间s/ (([^|]*|){6})术语(([^|]*|){6})为了可读性删除了反斜杠）将匹配：
  - [^|]* any character but '|', repeated multiple times [^|]*除'|'以外的任何字符，重复多次
  - | followed by a '|' 后跟一个“ |”
  - the above is obviously one column and it is grouped together with enclosing parantheses ( and ) 上面的内容显然是一栏，并与括起来的括号(和)组合在一起
  - the entire group is repeated 6 times {6} 整个组重复6次{6}
  - and this is again grouped together with enclosing parantheses ( and ) , to form one full set 然后再将其与括起来的括号(和)组合在一起，形成一个完整的集合

the rest of the term is easy to read: 该术语的其余部分很容易理解：

replace the above (the entire dataset of 6 fields) with \\1\\n , the part between / and /g 将上述（6个字段的整个数据集）替换为\\1\\n （ /和/g之间的部分）
\\1 refers to the "first" group in the sed-expression (the "first" group that is started, so it's the entire dataset of 6 fields) \\1表示sed-expression中的“第一个”组（已启动的“第一个”组，因此是6个字段的整个数据集）
\\n is the newline character \\n是换行符
so replace the entire dataset of 6 fields by itself followed by a newline 因此，请自行替换6个字段的整个数据集，然后再换行
and do so repeatedly (the trailing g ) 并重复这样做（结尾的g ）

Answer 3

you can use sed to convert every 6th | 您可以使用sed每6转换一次| to a newline. 换行。

In my version of tcsh I can do: 在我的tcsh版本中，我可以执行以下操作：

sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' filename

consider this: 考虑一下：

> cat bla
a1|b2|c3|d4|

> sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' bla
a1|b2|
c3|d4|

This is how the regex works: 正则表达式的工作方式如下：

[^|] is any non- | [^|]是非| character. 字符。
[^|]\\+ is a sequence of at least one non- | [^|]\\+是至少一个非|的序列。 characters. 字符。
[^|]\\+| is a sequence of at least one non- | 是至少一种非序列| characters followed by a | 字符后跟| . 。
\$[^|]\\+|\$ is a sequence of at least one non- | \$[^|]\\+|\$是至少一个非|的序列。 characters followed by a | 字符后跟| , grouped together ，分组在一起
\$[^|]\\+|\$\\{6\\} is 6 consecutive such groups. \$[^|]\\+|\$\\{6\\}是6个连续的此类组。
\$\\([^|]\\+|\$\\{6\\}\\) is 6 consecutive such groups, grouped together. \$\\([^|]\\+|\$\\{6\\}\\)是6个连续的这样的组，被分组在一起。

The replacement just takes this sequence of 6 groups and adds a newline to the end. 替换仅需按6个组的顺序进行，并在末尾添加换行符。

Answer 4

Here is how I would do it with awk 这是我将如何使用awk做到这一点

awk -v RS="|" '{printf $0 (NR%7?RS:"\n")}' file
field1a|field2a|field3a|field4a|field5a|field6a|[...]
field1d|field2d|field3d|field4d|field5d|field6d|[...]
field1m|field2m|field3m|field4m|field5m|field6m|[...]
field1z|field2z|field3z|field4z|field5z|field6z|

Just adjust the NR%7 to number of field you to what suites you. 只需将NR%7调整为适合您的字段数即可。

Answer 5

What about printing the lines on blocks of six? 怎样将行打印在六个块上？

$ awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}' file
field1a|field2a|field3a|field4a|field5a|field6a
field1d|field2d|field3d|field4d|field5d|field6d
field1m|field2m|field3m|field4m|field5m|field6m
field1z|field2z|field3z|field4z|field5z|field6z

Explanation 说明

BEGIN{FS=OFS="|"} set input and output field separator as | BEGIN{FS=OFS="|"}设置输入和输出字段分隔符为| . 。
{for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}} loop through items on blocks of 6. Every single time, print six of them. {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}循环浏览6个块上的项目。每次打印六个。 As print end up writing a new line, then you are done. 当print最终写出新行时，就完成了。

Answer 6

If you want to treat the files as being in multiple lines, then make \\n the field separator. 如果要将文件视为多行，请使用\\n字段分隔符。 For example, to get the 2nd column, just do: 例如，要获取第二列，只需执行以下操作：

tr \| \\n < input-file | sed -n 2p

To see which columns match a regex, do: 要查看哪些列与正则表达式匹配，请执行以下操作：

tr \| \\n < input-file | grep -n regex

拆分一个大的txt文件以执行grep-UNIX

问题描述

6 个解决方案

解决方案1
3 2014-05-26 13:15:59

解决方案2
2 已采纳 2014-05-26 13:08:24

解决方案3
2 2014-05-26 13:15:03

解决方案4
2 2014-05-26 14:51:40

解决方案5
1 2014-05-26 13:06:31

Explanation 说明

解决方案6
1 2014-05-26 13:07:33

拆分一个大的txt文件以执行grep-UNIX

问题描述

6 个解决方案

解决方案1 3 2014-05-26 13:15:59

解决方案2 2 已采纳 2014-05-26 13:08:24

解决方案3 2 2014-05-26 13:15:03

解决方案4 2 2014-05-26 14:51:40

解决方案5 1 2014-05-26 13:06:31

Explanation 说明

解决方案6 1 2014-05-26 13:07:33

解决方案1
3 2014-05-26 13:15:59

解决方案2
2 已采纳 2014-05-26 13:08:24

解决方案3
2 2014-05-26 13:15:03

解决方案4
2 2014-05-26 14:51:40

解决方案5
1 2014-05-26 13:06:31

解决方案6
1 2014-05-26 13:07:33