简体   繁体   中英

How to remove alternating whitespaces from file content

I am grepping on a file which occasionally has words that have alternating whitespaces in them.

For instance: hello this is an example

I would like this to become: hello this is an example

I am open for any command line tools to solve this problem. I would take the risk of single character words getting squashed (since they occur very seldomly in my files).

E. g.: hello this is a r isk I would take. becoming hello this is ariskI would take.

Something like this would work:

(?:(?<=^)|(?<= ))([^ ]) (?=[^ ] )

https://regex101.com/r/yLccGg/1

An example using node.js:

$ node -e "fs=require('fs'),fn='input.txt';fs.writeFileSync(fn,fs.readFileSync(fn,{encoding:'utf8'}).replace(/(?<=\b[a-z]) (?![A-Z]|\w\w\w)/g, ''));"

A space is replaced if it follows a lower-case letter and is not followed by a capital letter or three consecutive word characters.

Using GNU awk for patsplit() and gensub():

awk '{
    numFlds = patsplit($0,flds,/\<([^ ] )+[^ ]\>/,seps)
    out = seps[0]
    for ( i=1; i<=numFlds; i++ ) {
        out = out gensub(/([^ ]) /,"\\1","g",flds[i]) seps[i]
    }
    print out
}' file
hello this is an examp le

alternatively, still using GNU awk but now for the 3rd arg to match() and gensub():

$ awk '{
    while ( match($0,/\<(([^ ] )+[^ ]\>)(.*)/,a) ) {
        $0 = substr($0,1,RSTART-1) gensub(/([^ ]) /,"\\1","g",a[1]) a[3]
    }
    print
}' file
hello this is an examp le

You'd have to provide an algorithm explaining why le should be joined to the end of examp for that to happen.

I inserted a space before the last e of your examp le .
First you want to know the complete words.

echo "h e l l o this is an e x a m p l e"| sed -r 's/\w\w+/=&=/g'

result

h e l l o =this= =is= =an= e x a m p l e

Now all the isoloted characters can be removed in loop.

echo "h e l l o this is an e x a m p l e"| 
  sed -r 's/\w\w+/=&=/g;:a;s/( )([^ ])( |$)/\2\3/;ta'

result

hello =this= =is= =an=example

Next replace the equal signs with spaces and remove double spaces

echo "h e l l o this is an e x a m p l e"| 
  sed -r 's/\w\w+/=&=/g;:a;s/( )([^ ])( |$)/\2\3/;ta;s/=/ /g;s/[ ][ ]+/ /g'

The equal sign can be part of your string. When you use \r the intermediate results don't show the clear output strings, but will be better for text without \r . And when you think that an isolated I should be considered as a work, the solution is

echo "h e l l o this is an e x a m p l e that I l i k e"|
  sed -r 's/( I |\w\w+)/\r&\r/g;:a;s/( )([^ ])( |$)/\2\3/;ta; s/\r/ /g;s/[ ][ ]+/ /g'

Result:

hello this is an example that I like

Using awk can be easier:

echo "h e l l o this is an e x a m p l e that I l i k e"|
  awk '
    BEGIN { RS="[ \n]"; FS="" }
    NF==1 { printf("%s",$0); sep=" " }
    $0=="I" { printf(" ")}
    NF>1 { printf("%s%s ", sep, $0); sep =""}
    END {print ""}
  '

All "words" are moved to different lines, and when the linelength becomes 1 you don't want a space. Special rules for an isolated I . The sep is used for avoiding two spaces between two more-letter words like this is .
Result:

hello this is an example that I like

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM