How to properly match TAB in Regex using Perl?

Question

I have the following test.txt file

# Example:
# Comments
# Comments

MC

Attribute 1
Attribute 2
Attribute 3


---

MC

Attribute 1
Attribute 2
Attribute 3

---

MC 

Attribute 1
Attribute 2
Attribute 3

I want to perform

Remove comments
Remove empty lines
Replace \n by \t
Remove turn \t---\t into a \n

So that I achieve the following

MC <TAB> Attribute 1 <TAB> Attribute 2 <TAB> Attribute 3
MC <TAB> Attribute 1 <TAB> Attribute 2 <TAB> Attribute 3
MC <TAB> Attribute 1 <TAB> Attribute 2 <TAB> Attribute 3

For some reason the following doesn't work

perl -pe "s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g" test.txt

Producing the output

MC  Description --- MC  Description --- MC  Description

If I just run the following instead

perl -pe "s/#.*//g; s/^\n//g; s/\n/\t/g;" test.txt

I also have

MC  Description --- MC  Description --- MC  Description

It appear that the last command in s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g s/#.*//g; s/^\n//g; s/\n/\t/g; s/\t---\t/\n/g is not working.

Answer 1

You say you're removing \t---\t , but that doesn't appear to be in the input.

If you want to match a line that has only whitespace and --- on it, use ^\s*---\s*$ .

perl -pe "s/#.*//g; s/^\n//g; s/\n/\t/g; s/^\s*---\s*$/\n/g" test.txt

Note that this will leave you with no newline at the end of the file if there is no final --- .

If you want to process the whole line, use -0 . -0 controls the "input record separator" which Perl uses to decide what is a line. -0 alone sets it to null (assuming there are no null bytes) will read the whole file.

Then your original almost works. You need to add a /m so that ^ matches the beginning of a line as well as the beginning of a string.

perl -0pe "s/#.*//g; s/^\n//mg; s/\n/\t/g; s/\t---\t/\n/g" test.txt

But we can make this simpler! The input record separator separates records . Your record separator is ---\n , so we can set it to that and process each record individually.

To set the input record separator to a string, we use $/ . And to do this in a one-liner, we put it in a BEGIN block so it is run only once when the program starts, not for every line.

Finally, we use -l to both automatically strip the record separator, which is ---\n , and to add a newline to the end of each line. That is, it adds a chomp at the start and a $_.= "\n" at the end.

# Set the input record separator to ---\n.
# -l turns on autochomp to strip the separator.
# -l also adds a newline to each line.
# Strip comments.
# Strip blank lines (again, using /m so ^ works)
# Turn tabs into newlines.
perl -lpe 'BEGIN { $/ = "---\n" } s/#.*//mg; s/^\s*\n//mg; s/\n/\t/g;' test.txt

As a bonus, we get newlines on every line, including the last.

Finally, we can instead handle this using arrays. Same basic idea as before, but we split them back into lines and use grep to filter out unwanted lines. Then we're left with a simple join.

I'll write this one out long-hand so it's easier to read.

#!/usr/bin/env perl -lp

BEGIN { $/ = "---\n" }

# Split into lines.
# Strip comment lines.
# Strip blank lines.
# Join back together with tabs.
$_ = join "\t",
  grep /\S/,
  grep !/^#.*/,
  split /\n/, $_;

I find this approach more maintainable; it's easier to deal with an array of lines than everything mashed together in a multi-line string.

How to properly match TAB in Regex using Perl?

Question

1 answers

solution1
4 ACCPTED 2020-06-23 18:16:43

How to properly match TAB in Regex using Perl?

Question

1 answers

solution1 4 ACCPTED 2020-06-23 18:16:43

solution1
4 ACCPTED 2020-06-23 18:16:43