简体   繁体   中英

Remove Word smart quotes from a text file using vim

I have a large text file, originally generated in Microsoft Word, that contains these four character sequences, alongside regular text:

?~@~\
?~@~]
?~@~X
?~@~Y

From the content of what is written in the file, it appears that the sequences respectively correspond to open double quotes, close double quotes, open single quote, and close single quote. When displayed in Vim, everything in the sequences other than the question mark appears in blue.

I cannot remove them with a command such as

:.,$s/?~@~Y//

This command results in the following error from vim:

E33: No previous substitute regular expression
E476: Invalid command
Press ENTER or type command to continue

These commands also produce errors:

:.,$s/\?~@~Y//
:.,$s/\?\~\@\~Y//

Specifically,

E866: (NFA regexp) Misplaced ?
E476: Invalid command
Press ENTER or type command to continue

What would be the correct way to automatically remove or replace the sequences? Ideally, I'd like to remove the double quotes, and replace the open/close single quotes with a traditional single quote or apostrophe.

Since "everything in the sequences other than the question mark appears in blue", all characters except the question mark are probably binary characters. I'd suggest this approach:

  • go to the first sequence and yank it: press v to start marking, extend the mark to the end of the sequence, then press y
  • paste the sequence as the replace pattern from the unnamed register: :%s/ Ctrl - r " //g Enter
  • repeat for the remaining sequences.

Sorry to bump an old thread but I stumbled upon this late at night while trying to figure out how to remove the exact same characters from a bind9 configuration file that I had pasted in from a website. The aberrant characters were "~@~X", "~@~Y", " | ", and I believe another but I can't remember it at the moment. Anyway, regular expressions couldn't seem to find and replace using the above methods, but I was able to find a solution.

If you can set VIM to show the special characters in their binary representation, then you can use regex to find that. Here's how I did it:


Steps to fix

  1. Open the file with the problem characters in VIM

    • (a) original method - :set encoding=latin1|set isprint=|set display+=uhex
    • (b) easier method - :set encoding=utf-8

NOTE : either of these should display the digraph characters in their binary form <<<>>> (eg <80>, <99>, ... )

  1. Then search and replace with VIM regex like so

    :%s:\\%xNN:':g #replace NN with byte code (ie 80, 99, etc.)

Let's break that command down, shall we:

  • %s: - search command looking for all occurrences due to the % at the start and the 's' for search. The':' (colon) has been used as the delimiter in this case, but you can use other symbols to delimit the search command.

  • \\%x - the backslash escapes the %x which represents a byte code that we're looking for (ie <2 x numbers between brackets>)

  • NN - replace with the two chars inside of the <> that you're looking to replace in your file. In my case, the byte codes were <e2>, <80>, <99> , which I had to search for separately.

  • :' - then, the colon delimiting the replacement group where I'm specifying a single quote to replace the byte code, you could put whatever text you want here.

  • :g - finally, the last colon delineation and the letter 'g' which means to search the entire file top to bottom.


You can do more research in VIM's help with:

:help isprint

Anyway, I hope this helps someone else in the future.


References:


  1. https://blog-en.openalfa.com/how-to-edit-non-printing-and-unicode-characters-in-vim-editor

  2. https://unix.stackexchange.com/questions/108020/can-vim-display-ascii-characters-only-and-treat-other-bytes-as-binary-data

  3. VIM How do I search for a <XX> single byte representation

If you're using a unicode-compatible encoding (such as utf-8) and your font supports it, the smart quotes will show properly.

Additionally, the digraphs for them are 6' , 6" , 9' , and 9" . This makes it pretty easy to chain a couple of substitutes to swap them for straight variants:

%s/<C-k>6'\|<C-k>9'/'/g

Etc. Wrap it in a function or command to make it easier for later.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM