简体   繁体   English

如何找到与Perl混合大小写的字符串?

[英]How can I find strings that have mixed cased with Perl?

I'm trying to filter thousands of files, looking for those which contain string constants with mixed case. 我正在尝试过滤数千个文件,寻找那些包含大小写混合的字符串常量的文件。 Such strings can be embedded in whitespace, but may not contain whitespace themselves. 此类字符串可以嵌入在空格中,但本身可能不包含空格。 So the following (containing UC chars) are matches: 因此,以下(包含UC字符)是匹配项:

"  AString "   // leading and trailing spaces together allowed
"AString "     // trailing spaces allowed
"  AString"    // leading spaces allowed
"newString03"  // numeric chars allowed
"!stringBIG?"  // non-alphanumeric chars allowed
"R"            // Single UC is a match

but these are not: 但是这些不是:

"A String" // not a match because it contains an embedded space
"Foo bar baz" // does not match due to multiple whitespace interruptions
"a_string" // not a match because there are no UC chars

I still want to match on lines which contain both patterns: 我仍然想在包含两种模式的行上进行匹配:

"ABigString", "a sentence fragment" // need to catch so I find the first case...

I want to use Perl regexps, preferably driven by the ack command-line tool. 我想使用Perl正则表达式,最好由ack命令行工具驱动。 Obviously, \\w and \\W are not going to work. 显然, \\ w\\ W将不起作用。 It seems that \\S should match the non-space chars. 看来\\ S应该匹配非空格字符。 I can't seem to figure out how to embed the requirement of "at least one upper-case character per string"... 我似乎无法弄清楚如何嵌入“每个字符串至少一个大写字符”的要求。

ack --match '\"\s*\S+\s*\"'

is the closest I've gotten. 是我最接近的。 I need to replace the \\S+ with something that captures the "at least one upper-case (ascii) character (in any position of the non-whitespace string)" requirement. 我需要用捕获“至少一个大写(ascii)字符(在非空白字符串的任何位置)”要求的东西替换\\ S +

This is straightforward to program in C/C++ (and yes, Perl, procedurally, without resorting to regexps), I'm just trying to figure out if there is a regular expression which can do the same job. 这很容易在C / C ++中进行编程(是的,Perl,从程序上讲,无需求助于正则表达式),我只是想弄清楚是否存在可以完成相同工作的正则表达式。

The following pattern passes all your tests: 以下模式通过了所有测试:

qr/
  "      # leading single quote

  (?!    # filter out strings with internal spaces
     [^"]*   # zero or more non-quotes
     [^"\s]  # neither a quote nor whitespace
     \s+     # internal whitespace
     [^"\s]  # another non-quote, non-whitespace character
  )

  [^"]*  # zero or more non-quote characters
  [A-Z]  # at least one uppercase letter
  [^"]*  # followed by zero or more non-quotes
  "      # and finally the trailing quote
/x

Using this test program—that uses the above pattern without /x and therefore without whitespace and comments—as input to ack-grep (as ack is called on Ubuntu) 使用此测试程序(使用上述模式时不带/x ,因此不带空格和注释)作为ack-grep输入(如在Ubuntu上调用ack

#! /usr/bin/perl

my @tests = (
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"A String">     => 0 ],
  [ q<"a_string">     => 0 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
  [ q<"  a String  "> => 0 ],
  [ q<"Foo bar baz">  => 0 ],
);

my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
for (@tests) {
  my($str,$expectMatch) = @$_;
  my $matched = $str =~ /$pattern/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",
        ": $str\n";
}

produces the following output: 产生以下输出:

$ ack-grep '"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' try
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",

With the C shell and derivatives, you have to escape the bang: 使用C shell和派生类,您必须避免麻烦:

% ack-grep '"(?\![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' ...

I wish I could preserve the highlighted matches, but that doesn't seem to be allowed . 我希望我能保留突出显示的比赛,但这似乎是不允许的

Note that escaped double-quotes ( \\" ) will severely confuse this pattern. 请注意,转义的双引号( \\" )将严重混淆此模式。

You could add the requirement with a character class, like: 您可以使用字符类添加需求,例如:

ack --match "\"\s*\S+[A-Z]\S+\s*\""

I'm assuming that ack matches one line at a time. 我假设ack匹配一行。 The \\S+\\s*\\" part can match multiple closing quotes in a row. It would match the entirety of "alfa"" , instead of just "alfa" . \\S+\\s*\\"部分可以连续匹配多个右引号,它将匹配整个"alfa""而不是"alfa"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM