简体   繁体   English

如何使用 Perl 在正则表达式中正确匹配 TAB?

[英]How to properly match TAB in Regex using Perl?

I have the following test.txt file我有以下test.txt文件

# Example:
# Comments
# Comments

MC

Attribute 1
Attribute 2
Attribute 3


---

MC

Attribute 1
Attribute 2
Attribute 3

---

MC 

Attribute 1
Attribute 2
Attribute 3

I want to perform我要表演

  1. Remove comments删除评论
  2. Remove empty lines删除空行
  3. Replace \n by \t\n替换为\t
  4. Remove turn \t---\t into a \n去掉把\t---\t变成\n

So that I achieve the following这样我就实现了以下

MC <TAB> Attribute 1 <TAB> Attribute 2 <TAB> Attribute 3
MC <TAB> Attribute 1 <TAB> Attribute 2 <TAB> Attribute 3
MC <TAB> Attribute 1 <TAB> Attribute 2 <TAB> Attribute 3

For some reason the following doesn't work由于某种原因,以下不起作用

perl -pe "s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g" test.txt

Producing the output生产output

MC  Description --- MC  Description --- MC  Description

If I just run the following instead如果我只是运行以下命令

perl -pe "s/#.*//g; s/^\n//g; s/\n/\t/g;" test.txt

I also have我也有

MC  Description --- MC  Description --- MC  Description

It appear that the last command in s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g似乎s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g is not working. s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g不工作。

You say you're removing \t---\t , but that doesn't appear to be in the input.您说您要删除\t---\t ,但这似乎不在输入中。

If you want to match a line that has only whitespace and --- on it, use ^\s*---\s*$ .如果要匹配只有空格和---的行,请使用^\s*---\s*$

perl -pe "s/#.*//g; s/^\n//g; s/\n/\t/g; s/^\s*---\s*$/\n/g" test.txt

Note that this will leave you with no newline at the end of the file if there is no final --- .请注意,如果没有最终--- ,这将使您在文件末尾没有换行符。


If you want to process the whole line, use -0 .如果要处理整行,请使用-0 -0 controls the "input record separator" which Perl uses to decide what is a line. -0控制 Perl 用来决定什么是行的“输入记录分隔符”。 -0 alone sets it to null (assuming there are no null bytes) will read the whole file. -0单独将其设置为 null(假设没有 null 字节)将读取整个文件。

Then your original almost works.然后你的原件几乎可以工作。 You need to add a /m so that ^ matches the beginning of a line as well as the beginning of a string.您需要添加/m以便^匹配行的开头以及字符串的开头。

perl -0pe "s/#.*//g; s/^\n//mg; s/\n/\t/g; s/\t---\t/\n/g" test.txt

But we can make this simpler!但我们可以让这更简单! The input record separator separates records .输入记录分隔符分隔记录 Your record separator is ---\n , so we can set it to that and process each record individually.您的记录分隔符是---\n ,因此我们可以将其设置为该分隔符并单独处理每条记录

To set the input record separator to a string, we use $/ .要将输入记录分隔符设置为字符串,我们使用$/ And to do this in a one-liner, we put it in a BEGIN block so it is run only once when the program starts, not for every line.为了在单行中执行此操作,我们将其放在BEGIN块中,因此它仅在程序启动时运行一次,而不是针对每一行。

Finally, we use -l to both automatically strip the record separator, which is ---\n , and to add a newline to the end of each line.最后,我们使用-l来自动去除记录分隔符---\n ,并在每行的末尾添加一个换行符。 That is, it adds a chomp at the start and a $_.= "\n" at the end.也就是说,它在开头添加一个chomp ,在末尾添加一个$_.= "\n"

# Set the input record separator to ---\n.
# -l turns on autochomp to strip the separator.
# -l also adds a newline to each line.
# Strip comments.
# Strip blank lines (again, using /m so ^ works)
# Turn tabs into newlines.
perl -lpe 'BEGIN { $/ = "---\n" } s/#.*//mg; s/^\s*\n//mg; s/\n/\t/g;' test.txt

As a bonus, we get newlines on every line, including the last.作为奖励,我们在每一行都有换行符,包括最后一行。


Finally, we can instead handle this using arrays.最后,我们可以使用 arrays 来处理这个问题。 Same basic idea as before, but we split them back into lines and use grep to filter out unwanted lines.与以前相同的基本思想,但我们将它们拆分回行并使用grep过滤掉不需要的行。 Then we're left with a simple join.然后我们只剩下一个简单的连接。

I'll write this one out long-hand so it's easier to read.我将把这个写出来,这样更容易阅读。

#!/usr/bin/env perl -lp

BEGIN { $/ = "---\n" }

# Split into lines.
# Strip comment lines.
# Strip blank lines.
# Join back together with tabs.
$_ = join "\t",
  grep /\S/,
  grep !/^#.*/,
  split /\n/, $_;

I find this approach more maintainable;我发现这种方法更易于维护; it's easier to deal with an array of lines than everything mashed together in a multi-line string.处理一组行比处理多行字符串中的所有内容更容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM