简体   繁体   English

grep for(curly | microsoft | smart)引号

[英]grep for (curly|microsoft|smart) quotes

I have a huge folder filled with xml documents, some of which may break because they contain those curly quotes, ie Microsoft Word quotes, ie smart quotes. 我有一个巨大的文件夹,里面装满了xml文件,其中一些文件可能会破坏,因为它们包含那些卷曲的引号,即Microsoft Word引号,即智能引号。 I just want to run a quick check to see what I'm up against. 我只是想快速检查一下,看看我遇到了什么。 Anybody know how to grep for them so I can easily find the offenders? 有谁知道如何为他们grep所以我很容易找到罪犯?

Edit 编辑

Here's a simplified example. 这是一个简化的例子。

<?xml version="1.0" encoding="UTF-8"?>
<items>
  <item>Pretend this is a curly quote: '</item>
</items>

Curly quotes has the following Unicode code points and UTF-8 sequence: Curly引号具有以下Unicode代码点和UTF-8序列:

Name                                     CodePoint     UTF-8 sequence
----                                     ---------     --------------
LEFT SINGLE QUOTATION MARK               U+2018        0xE2 0x80 0x98
RIGHT SINGLE QUOTATION MARK              U+2019        0xE2 0x80 0x99
SINGLE LOW-9 QUOTATION MARK              U+201A        0xE2 0x80 0x9A
SINGLE HIGH-REVERSED-9 QUOTATION MARK    U+201B        0xE2 0x80 0x9B 
LEFT DOUBLE QUOTATION MARK               U+201C        0xE2 0x80 0x9C
RIGHT DOUBLE QUOTATION MARK              U+201D        0xE2 0x80 0x9D
DOUBLE LOW-9 QUOTATION MARK              U+201E        0xE2 0x80 0x9E
DOUBLE HIGH-REVERSED-9 QUOTATION MARK    U+201F        0xE2 0x80 0x9F

XML is usually stored in UTF-8, so you could just compare directly for the byte sequence. XML通常以UTF-8存储,因此您可以直接比较字节序列。

You can find files containing the UTF-8 sequences dalle mentioned thusly: 你可以找到包含这样提到的UTF-8序列的文件:

grep -r -P "\xE2\x80\x9C" .

The -r makes it recursive and the -P tells grep to use Perl compatible regular expressions. -r使其递归, -P告诉grep使用Perl兼容的正则表达式。

MIGHT BE A DUPLICATE 可能是一个重复

I had a situation where the user would copy paste strings from anywhere and I had to allow them an entry excluding any special character except quotes. 我有一种情况,用户可以从任何地方复制粘贴字符串,我不得不允许他们一个条目,不包括除引号之外的任何特殊字符。 Quotes be it smart/fancy/straight for that matter. 对于这件事,行情是聪明/花哨/直接的。 Let me exemplify: 让我举例说明:

Text    | Error
----------------
O*Connor| Yes

O'Connor| No

O’Connor| No

And I came up with below solution for my CF code. 我为我的CF代码提出了以下解决方案。

<cfif #REFind("[[:punct:],[:digit:]]",textName)# GT 0 >     
    <cfset temp_name = textName.ReplaceAll(JavaCast( "string", "[^A-Za-z\u2018\u2019\u201A\u201B\u2032\u2035\'\-\ ]" ),JavaCast( "string", "" )) >
<cfif (len(temp_name )EQ len(textName)) >
    <!--- If you find single quote or hyphen, do nothing --->
<cfelse>
    <cfset errormsg = The Text contains special charctaer">
</cfif>

Immense help from: http://axonflux.com/handy-regexes-for-smart-quotes 来自http://axonflux.com/handy-regexes-for-smart-quotes的巨大帮助

I am on a Mac, and the built-in grep didn't work for me right away (neubert's answer.) I ended up installing Homebrew's version of GNU grep: 我在Mac上,内置的grep不能立即为我工作(neubert的答案。)我最终安装了Homebrew的GNU grep版本:

brew tap homebrew/dupes
brew install homebrew/dupes/grep

Then I could run the commands in a similar fashion: 然后我可以以类似的方式运行命令:

ggrep -r -P "\xE2\x80\x9C" .
etc.

I ended up combining dalle and neubert's answers into this script which will run all of the cases that I currently know about and print them all off. 我最终将dalle和neubert的答案结合到这个脚本中该脚本将运行我目前所知的所有案例并将它们全部打印出来。

If they're xml documents, you could open one of them that you know contains the offending quotes, to see exactly what they look like in the xml file (and copy them to clipboard, if you can't reproduce them easily with your keyboard). 如果它们是xml文档,您可以打开其中一个包含违规引号的文档,以查看它们在xml文件中的确切含义(并将它们复制到剪贴板,如果您无法使用键盘轻松复制它们)。

Assuming that your quotes look like or , You could do something like sed -i .bak 's/[”„]/"/' file1 file2 ... (if using Linux/OSX/cygwin on Windows) to quickly substitute the offending quotes with normal quotes, modifying the files in-place. 假设你的引号看起来像 ,你可以做一些像sed -i .bak 's/[”„]/"/' file1 file2 ... (如果在Windows上使用Linux / OSX / cygwin)来快速替换带有正常引号的违规报价,就地修改文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM