简体   繁体   中英

Regex with substitutions using sed|awk and groups

I have this input text

16789248,16789759,"AS24155 Asia Pacific Broadband Wireless Communications Inc"

I want this text

"AS24155","Asia Pacific Broadband Wireless Communications Inc"

This regex matches

 /(.*)(AS\d+)(\s)([^"]+).*/g

with this substitution "$2","$4"

I have to process 300k lines and it would be best if I was able to use a linux based command line utility like sed or awk...but I keep getting no matches or matches even though the regex seems to match elsewhere.

Should I be using something different?

sed -r can handle it with a few modifications: [0-9] instead of \\d and <space> instead of \\s . There's no real reason to capture the first and third parts, so I've removed those groups.

sed -r -e 's/.*(AS[0-9]+) ([^"]+).*/"\1","\2"/'

Or if you want to match those character classes exactly, use [[:digit:]] for \\d and [[:space:]] for \\s :

sed -r -e 's/.*(AS[[:digit:]]+)[[:space:]]([^"]+).*/"\1","\2"/'

Alternatively, you could use csvtool which is more suited to the job of parsing CSV files than sed is.

csvtool col 3 input.txt | while read number name; do
    printf '"%s","%s"\n' "$number" "$name"
done
sed 's/[^"]*"/"/;s[[:space:]]/","/'

根据您的样本并避免分组的问题

sed is the best choice for this but FYI here's how you could use almost your exact RE in GNU awk to do the job:

$ awk 'match($0,/.*(AS[0-9]+)\s([^"]+).*/,a){printf "\"%s\",\"%s\"\n", a[1], a[2]}' file
"AS24155","Asia Pacific Broadband Wireless Communications Inc"

Your original command was probably failing because only some tools accept \\s instead of [[:space:]] and almost none accept \\d instead of [[:digit:]] (or [0-9] ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM