I am grepping on a file which occasionally has words that have alternating whitespaces in them.
For instance: hello this is an example
I would like this to become: hello this is an example
I am open for any command line tools to solve this problem. I would take the risk of single character words getting squashed (since they occur very seldomly in my files).
E. g.: hello this is a r isk I would take.
becoming hello this is ariskI would take.
An example using node.js:
$ node -e "fs=require('fs'),fn='input.txt';fs.writeFileSync(fn,fs.readFileSync(fn,{encoding:'utf8'}).replace(/(?<=\b[a-z]) (?![A-Z]|\w\w\w)/g, ''));"
A space is replaced if it follows a lower-case letter and is not followed by a capital letter or three consecutive word characters.
Using GNU awk for patsplit() and gensub():
awk '{
numFlds = patsplit($0,flds,/\<([^ ] )+[^ ]\>/,seps)
out = seps[0]
for ( i=1; i<=numFlds; i++ ) {
out = out gensub(/([^ ]) /,"\\1","g",flds[i]) seps[i]
}
print out
}' file
hello this is an examp le
alternatively, still using GNU awk but now for the 3rd arg to match() and gensub():
$ awk '{
while ( match($0,/\<(([^ ] )+[^ ]\>)(.*)/,a) ) {
$0 = substr($0,1,RSTART-1) gensub(/([^ ]) /,"\\1","g",a[1]) a[3]
}
print
}' file
hello this is an examp le
You'd have to provide an algorithm explaining why le
should be joined to the end of examp
for that to happen.
I inserted a space before the last e
of your examp le
.
First you want to know the complete words.
echo "h e l l o this is an e x a m p l e"| sed -r 's/\w\w+/=&=/g'
result
h e l l o =this= =is= =an= e x a m p l e
Now all the isoloted characters can be removed in loop.
echo "h e l l o this is an e x a m p l e"|
sed -r 's/\w\w+/=&=/g;:a;s/( )([^ ])( |$)/\2\3/;ta'
result
hello =this= =is= =an=example
Next replace the equal signs with spaces and remove double spaces
echo "h e l l o this is an e x a m p l e"|
sed -r 's/\w\w+/=&=/g;:a;s/( )([^ ])( |$)/\2\3/;ta;s/=/ /g;s/[ ][ ]+/ /g'
The equal sign can be part of your string. When you use \r
the intermediate results don't show the clear output strings, but will be better for text without \r
. And when you think that an isolated I
should be considered as a work, the solution is
echo "h e l l o this is an e x a m p l e that I l i k e"|
sed -r 's/( I |\w\w+)/\r&\r/g;:a;s/( )([^ ])( |$)/\2\3/;ta; s/\r/ /g;s/[ ][ ]+/ /g'
Result:
hello this is an example that I like
Using awk
can be easier:
echo "h e l l o this is an e x a m p l e that I l i k e"|
awk '
BEGIN { RS="[ \n]"; FS="" }
NF==1 { printf("%s",$0); sep=" " }
$0=="I" { printf(" ")}
NF>1 { printf("%s%s ", sep, $0); sep =""}
END {print ""}
'
All "words" are moved to different lines, and when the linelength becomes 1 you don't want a space. Special rules for an isolated I
. The sep
is used for avoiding two spaces between two more-letter words like this is
.
Result:
hello this is an example that I like
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.