简体   繁体   中英

Regex to match special characters EXCEPT hyphen(s) mixed with number(s)

We are currently using [^a-zA-Z0-9] in Java's replaceAll function to strip special characters from a string. It has come to our attention that we need to allow hyphen(s) when they are mixed with number(s).

Examples for which hyphens will not be matched:

  • 1-2-3
  • -1-23-4562
  • --1---2--3---4-
  • --9--a--7
  • 425-12-3456

Examples for which hyphens will be matched:

  • --a--b--c
  • wal-mart

We think we formulated a regex to meet the latter criteria using this SO question as a reference but we have no idea how to combine it with the original regex [^a-zA-Z0-9] .

We are wanting to do this to a Lucene search string because of the way Lucene's standard tokenizer works when indexing:

Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

You can't do this with a single regex. (Well... maybe in Perl.)

( edit : Okay, you can do it with variable-length negative lookbehind, which it appears Java can (almost uniquely!) do; see Cyborgx37's answer. Regardless, imo, you shouldn't do this with a single regex. :))

What you can do is split the string into words and deal with each word individually. My Java is pretty terrible so here is some hopefully-sensible Python:

# Precompile some regex
looks_like_product_number = re.compile(r'\A[-0-9]+\Z')
not_wordlike = re.compile(r'[^a-zA-Z0-9]')
not_wordlike_or_hyphen = re.compile(r'[^-a-zA-Z0-9]')

# Split on anything that's not a letter, number, or hyphen -- BUT dots
# must be followed by whitespace
words = re.split(r'(?:[^-.a-zA-Z0-9]|[.]\s)+', string)

stripped_words = []
for word in words:
    if '-' in word and not looks_like_product_number.match(word):
        stripped_word = not_wordlike.sub('', word)
    else:
        # Product number; allow dashes
        stripped_word = not_wordlike_or_hyphen.sub('', word)

    stripped_words.append(stripped_word)

pass_to_lucene(' '.join(stripped_words))

When I run this with 'wal-mart 1-2-3' , I get back 'walmart 1-2-3' .

But honestly, the above code reproduces most of what the Lucene tokenizer is already doing. I think you'd be better off just copying StandardTokenizer into your own project and modifying it to do what you want.

你试过这个:

[^a-zA-Z0-9-]

This question is tricky because Java does not allow infinite recursion in a lookaround, which is basically what you need. I've made due with a 100 character limit, as you will see, which you can increase if you expect the words to be longer.

This should work:

(?<![0-9]\S{0,100})[^a-zA-Z](?!\S{0,100}[0-9])|(?<=[0-9]\S{0,100})[^a-zA-Z0-9-](?=\S{0,100}[0-9])

Just a simple replaceAll() with this expression should handle it.

For example, consider this input:

--9-+-a--7 wal-mart

The expression above, where the offending characters are replaced with a zero-length string, will render the following output:

--9--a--7 walmart

You can try it out here: http://fiddle.re/ynyu

Note that this expression depends on words being separated by white space (spaces, tabs, newlines, etc). Other characters, such as commas and semicolons, will cause the expression to consider the two words as one. For example '---9-a-0-,wal-mart' will be treated as a single word.

EDIT The last paragraph from my previous edit was incorrect. If you want to include other characters as delimiters, I recommend replacing them with whitespace in a first-pass (for example, replacing ',' with ' ').

I'm primarily a .NET programmer, otherwise I'd give you the complete Java code for using this pattern.

Forgive me posting a second answer instead of editing the first, but I am not entirely sure if the problem is to eliminate the dashes in the cases where they are immediately surrounded by letters, or if the intent is to eliminate dashes only in strings that do not contain numbers at all. This solution is for the latter case. My other solution is for the former case.

This pattern

String newValue = myString.replaceAll("[^\\sA-Za-z0-9\\-]|((?<!\\S*\\d)-(?!\\S*\\d))", "");

should do it. There are two main pieces joined with an or . The first piece matches all non-alpha, non-numeric, non-dash characters, since we want to strip these characters out no matter what. The second half of the or will match any dash that has no digit anywhere before it in the token, and nowhere after it in the token (ie, no digits in the token at all, where a tokens are comprised of all non-whitespace, or \\S , characters). This is accomplished with the negative look-behind and look-ahead. We do have leverage the fact that Java supports variable width in these look-ahead/behind. Of course the replacement is just the empty string.

I have to admit, although the syntax for using regex is painful in Java (in the case where you have to use Pattern.compile, etc.), at least the engine supports some nice features. Although maybe not as nice as .NET according to Eevee.

I agree with others, though, in that this is not really something you typically want to do in a single regex. I don't know your exact situation, but a simple branch to detect whether or not it appears to be a product number, and then apply the correct pattern would be much more readable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM