简体   繁体   中英

Regex Must Match a Word (not to replace) AND a Pattern (to replace) in a Line

With regex (can be PCRE or SED, but can also python[please specify]), I want to remove all occurrences of the lines that contain a single letter comma (/,.,/g) and with the word "Labels:"

So for example in these lines:

Labels: K,ltemittel,System,j,Vakuum,s
Another tags: a,b,xxx,c,yyy,z

to

Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z

What I've tried:

  • non-capturing group ("Labels:" still also getting replaced)
  • lookahead and lookbehind (cannot use greedy)
  • grouping /(Labels:)*(,.,) (also capturing the non "Labels:")

Using sed

$ sed '/Labels:/s/,[A-Za-z]\>//g;s/\<[A-Za-z],//' input_file
Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z

Explanation (Added By Tripleee)

It looks for a comma, followed by an alphabetic, followed by a word boundary, ie the label after the comma is a single letter. Then, it removes any remaining single-letter label immediately before a comma by similar logic

You could potentially use:

(?i)(^(?!Labels:).*)|\b[a-z],|,[a-z]\b

See an online demo


  • (?i) - Set case-insensitive matching 'on';
  • ( - Open 1st capture group;
    • ^ - Start string anchor;
    • (?:labels:) - Assert position is not followed by 'Labels:';
    • .* - Match (Greedy) 0+ characters other than newline;
    • ) - Close 1st capture group;
  • | - Or;
  • \b[az], - Match a word-boundary followed by a single letter and a comma;
  • | - Or;
  • ,[az]\b - Match a comma followed by a single letter and a word-boundary.

Now replace it with your 1st capture group.

Another variation using gnu-awk .

For a line that starts with Labels: replace a comma followed by a single char az or AZ and a word boundary with an empty string.

awk '/^Labels:/{gsub(/,[a-zA-Z]\y|\y[a-zA-Z],/, "")};1' file

Output

Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z

As you have tagged Python and pcre, another option is to use the \G anchor and match Label: at the start of the string, and capture in group 1 what you want to keep.

(?:^Labels:\h*|\G(?!^))\K(?:([^\s,]{2,}(?:,(?![a-z]$))?)|,?[a-z],?)

See a regex demo and a Python demo using the Python PyPi regex module .

Using :

perl -lpe 's/(?:,[^,](?=,|$))+//g  if  s/^Labels:\s*\K(?:[^,](?:,|$))*//' file

After matching "Labels:" (which is \K ept), remove any leading single character items. If that happened, remove all other single character items. This assumes that the "Labels:" part cannot contain single characters separated by commas.

$ cat file
Labels: K,ltemittel,a System z,j,Vakuum,s
Another tags: a,b,xxx,c,yyy,z
$ perl -lpe 's/(?:,[^,](?=,|$))+//g  if  s/^Labels:\s*\K(?:[^,](?:,|$))*//' file
Labels: ltemittel,a System z,Vakuum
Another tags: a,b,xxx,c,yyy,z

Note: System was changed to a System z in the above test. Solutions that rely on matching spaces or word boundaries may not deal with this input correctly.

This might work for you (GNU sed):

sed -E '/Labels/{s/( )\S,|(,)\S,|,\S$/\1\2/g;s//\1\2/g}' file

If a line contains Labels , pattern match for 3 alternate matches and if either the first and second match replace by the matching back reference. Repeat for any overlapping.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM