简体   繁体   English

使用 sed 或 awk 从字符串中删除前导和尾随数字,同时保留 2 个数字

[英]Remove leading and trailing numbers from string, while leaving 2 numbers, using sed or awk

I have a file containing lines like:我有一个包含以下行的文件:

353451word2423157
anotherword
7412yetanother1
3262andherese123anotherline4359013
5342512354325324523andherese123anotherline45913
532453andherese123anotherline413

I'd like to strip most of the leading and tailing numbers (0-9), while still leaving 2 leading and trailing numbers in place, if any...我想去掉大部分前导和尾随数字(0-9),同时仍然保留 2 个前导和尾随数字,如果有的话......

To clarify, for the list above, the expected output would be:澄清一下,对于上面的列表,预期的输出是:

51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41

Preferred tools would be sed or awk, but any other suggestions are welcome...首选工具是 sed 或 awk,但欢迎任何其他建议...

I've tried something like sed 's/[0-9]\\+$//' | sed 's/^[0-9]\\+//'我试过类似sed 's/[0-9]\\+$//' | sed 's/^[0-9]\\+//' sed 's/[0-9]\\+$//' | sed 's/^[0-9]\\+//' , but obviously this strips all leading and trailing numbers... sed 's/[0-9]\\+$//' | sed 's/^[0-9]\\+//' ,但显然这会去除所有前导和尾随数字......

You may try this sed :你可以试试这个sed

sed -E 's/^[0-9]+([0-9]{2})|([0-9]{2})[0-9]+$/\1\2/g' file

51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41

Command Details:命令详情:

  • ^[0-9]+([0-9]{2}) : Match 1+ digits at start if that is followed by 2 digits (captured in a group) and replace with 2 digits in group #1. ^[0-9]+([0-9]{2}) :如果后面是 2 个数字(在一个组中捕获),则匹配开头的 1+ 个数字,并替换为第 1 组中的 2 个数字。
  • ([0-9]{2})[0-9]+$ : Match 1+ digits at the end if that is preceded by 2 digits (captured in a group) and replace with 2 digits in group #2. ([0-9]{2})[0-9]+$ :如果前面有 2 位数字(在一组中捕获),则匹配末尾的 1+ 位数字,并替换为组 #2 中的 2 位数字。

I suggest using perl :我建议使用perl

perl -pe 's/^\d+(?=\d{2})|(\d{2})\d+$/$1/' file

See the online demo and the regex demo .请参阅在线演示正则表达式演示

Regex details :正则表达式详细信息

  • ^ - start of string ^ - 字符串的开始
  • \\d+ - one or more digits \\d+ - 一位或多位数字
  • (?=\\d{2}) - on the right, there must be two digits (not added to the match as the lookahead is a non-consuming pattern) (?=\\d{2}) - 在右边,必须有两个数字(没有添加到匹配中,因为前瞻是一个非消耗模式)
  • | - or - 或者
  • (\\d{2}) - two digits captured into Group 1 ( $1 ) (\\d{2}) - 捕获到组 1 ( $1 ) 中的两位数字
  • \\d+ - one or more digits \\d+ - 一位或多位数字
  • $ - end of string. $ - 字符串的结尾。

Here is an awk that trims digits to a max of 2 on each side of a string:这是一个 awk,它在字符串的每一侧将数字修剪为最多 2 个:

awk '{  match($0, /^[0-9]*/); lh=RLENGTH
        s=substr($0, lh>2 ? lh-1 : 1)
        match(s, /[0-9]*$/); rh=RLENGTH
        print substr(s, 1, rh>2 ? length(s)-rh+2 : length(s))
}' file

Prints:印刷:

51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41

Using GNU awk gensub() function with parentheses in the regexp to mark the components and then specifying them in the replacement (here "\\\\2\\\\3" )在正则表达式中使用带括号的 GNU awk gensub()函数来标记组件,然后在替换中指定它们(此处为"\\\\2\\\\3"

awk '{print gensub(/^([[:digit:]]*)([[:digit:]]{2})|([[:digit:]]{2})([[:digit:]]*)$/,"\\2\\3","g",$0)}' file
51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41

I would use GNU AWK following way, let file.txt content be我会按照以下方式使用 GNU AWK ,让file.txt内容为

353451word2423157
anotherword
7412yetanother1
3262andherese123anotherline4359013
5342512354325324523andherese123anotherline45913
532453andherese123anotherline413

then然后

awk 'BEGIN{FPAT="[0-9]+|[^0-9]+";OFS=""}$1~/[0-9]+/{$1=substr($1,length($1)-1)}$NF~/[0-9]+/{$NF=substr($NF,1,2)}{print}' file.txt

output输出

51word24
anotherword
12yetanother1
62andherese123anotherline43
23andherese123anotherline45
53andherese123anotherline41

Explanation: I instruct GNU AWK to split into fields which consist solely of digits or solely of non-digits using FPAT .说明:我指示 GNU AWK使用FPAT拆分为仅由数字组成或仅由非数字组成的FPAT If 1st column ( $1 ) consist of digits, I slice it to get 2 last characters.如果第一列 ( $1 ) 由数字组成,我将其切片以获得 2 个最后一个字符。 If last column ( $NF ) consist solely of digits, I slice it to get 2 first characters.如果最后一列 ( $NF ) 仅由数字组成,我会将其切片以获取 2 个第一个字符。 Finally whole line is print ed using empty string as output field seperator ( OFS ).最后使用空字符串作为输出字段分隔符( OFSprint整行。

(tested in gawk 4.2.1) (在 gawk 4.2.1 中测试)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM