使用命令行查找和替换（组织）？

Question

I have a huge txt file that's organized like this: 我有一个庞大的txt文件，其组织方式如下：

Test User</b></a>&nbsp;</td>
user@hotmail.com</a></td>
04-17-2012</span></td>
02-13-2013</span></td>
Another Test</b></a>&nbsp;</td>
fake@spam4.me</a></td>
11-06-2011</span></td>
11-09-2012</span></td>
Username123</b></a>&nbsp;</td>
email@test.com</a></td>
06-07-2011</span></td>
06-03-2013</span></td>
AdminTest</b></a>&nbsp;</td>
testing@gmail.com</a></td>
05-01-2012</span></td>
06-05-2014</span></td>

Here's how I want the list to actually look: 这是我希望列表实际显示的样子：

Test User,user@hotmail.com,04-17-2012,02-13-2013
Another Test,fake@spam4.me,11-06-2011,11-09-2012
Username123,email@test.com,06-07-2011,06-03-2013
AdminTest,testing@gmail.com,05-01-2012,06-05-2014

Is there any simple way to do this through command line or should I be trying to go a different route? 有什么简单的方法可以通过命令行执行此操作，还是应该尝试采用其他方法？

Answer 1

Using awk you can do: 使用awk您可以执行以下操作：

awk -v OFS=, '{sub(/<.*$/, "")} NR%4==1{a=$1} NR%4==2{b=$1} NR%4==3{c=$1}
    NR%4==0{print a, b, c, $1}' file
Test,user@hotmail.com,04-17-2012,02-13-2013
Another,fake@spam4.me,11-06-2011,11-09-2012
Username123,email@test.com,06-07-2011,06-03-2013
AdminTest,testing@gmail.com,05-01-2012,06-05-2014

sub command strips all HTML tags after data sub命令删除数据后的所有HTML标签
Using modulo math it grabs 4 values from 4 consecutive rows 使用modulo数学可以从4个连续的行中获取4个值
When NR%4==0 it prints all the values. 当NR%4==0将打印所有值。

Answer 2

Step 1 is to remove the XML-ish end-tag junk. 步骤1是删除带有XML标记的结束标签。 That might be: 可能是：

sed 's/<.*//'

Step 2 is to collect related lines into one. 第2步是将相关行收集为一个。 For that, I'd use awk . 为此，我将使用awk 。 One issue is how many lines constitute an entry. 一个问题是多少行构成一个条目。 Are there always just two dates, or could there be a variable number? 总是只有两个日期，还是会有一个可变的数字？ Is a new entry always started with a capital letter, or should we just assume any letters? 是一个新条目总是以大写字母开头，还是我们应该假设任何字母？ Does a user name ever start with a digit? 用户名是否以数字开头？

Assuming that a line with a leading capital letter starts a new entry, then this accumulates an arbitrary number of email address lines and date lines after a user name line: 假设以大写字母开头的行开始一个新条目，那么这会在用户名行之后累积任意数量的电子邮件地址行和日期行：

awk '/^[A-Z]/ { if (line != "") print line; line = $0; next }
              { line = line "," $0 }
     END      { if (line != "") print line }'

It's a bit messy though to run two commands, so we can get awk to clean up its input with: 虽然运行两个命令有点混乱，所以我们可以使用以下命令来awk清理其输入：

awk '         { sub(/<.*/, "") }
     /^[A-Z]/ { if (line != "") print line; line = $0; next }
              { line = line "," $0 }
     END      { if (line != "") print line }'

If the criteria for separating blocks of input lines are different (the key is not the leading capital letter), then the code can be changed accordingly. 如果分隔输入行块的标准不同（键不是首字母大写），则可以相应地更改代码。

使用命令行查找和替换（组织）？

问题描述

2 个解决方案

解决方案1
0 已采纳 2014-12-28 17:30:25

解决方案2
0 2014-12-28 17:37:31

使用命令行查找和替换（组织）？

问题描述

2 个解决方案

解决方案1 0 已采纳 2014-12-28 17:30:25

解决方案2 0 2014-12-28 17:37:31

解决方案1
0 已采纳 2014-12-28 17:30:25

解决方案2
0 2014-12-28 17:37:31