简体   繁体   中英

Given a string, generate a regex that can parse *similar* strings

For example, given the string "2009/11/12" I want to get the regex ("\\d{2}/d{2}/d{4}"), so I'll be able to match "2001/01/02" too.

Is there something that does that? Something similar? Any idea' as to how to do it?

There is text2re , a free web-based "regex by example" generator.

I don't think this is available in source code, though. I dare to say there is no automatic regex generator that gets it right without user intervention, since this would require the machine knowing what you want.


Note that text2re uses a template-based, modularized and very generalized approach to regular expression generation. The expressions it generates work, but they are much more complex than the equivalent hand-crafted expression. It is not a good tool to learn regular expressions because it does a pretty lousy job at setting examples.

For instance, the string "2009/11/12" would be recognized as a yyyymmdd pattern, which is helpful. The tool transforms it into this 125 character monster:

((?:(?:[1]{1}\d{1}\d{1}\d{1})|(?:[2]{1}\d{3}))[-:\/.](?:[0]?[1-9]|[1][012])[-:\/.](?:(?:[0-2]?\d{1})|(?:[3][01]{1})))(?![\d])

The hand-made equivalent would take up merely two fifths of that (50 characters):

([12]\d{3})[-:/.](0?\d|1[0-2])[-:/.]([0-2]?\d|3[01])\b

It's not possible to write a general solution for your problem. The trouble is that any generator probably wouldn't know what you want to check for, eg should "2312/45/67" be allowed too? What about "2009.11.12"?

What you could do is write such a generator yourself that is suited for your exact problem, but a general solution won't be possible.

Excuse me, but what you all call impossible is clearly an achievable task. It will not be able to give results for ALL examples, and maybe not the best results, but you can give it various hints, and it will make life easy. A few examples will follow.

Also a readable output translating the result would be very useful. Something like:

  • "Search for: a word starting with a non-numeric letter and ending with the string: "ing".
  • or: Search for: text that has bbb in it, followed somewhere by zzz
  • or: *Search for: a pattern which looks so "aa/bbbb/cccc" where "/" is a separator, "aa" is two digits, "bbbb" is a word of any length and "cccc" are four digits between 1900 and 2020 *

Maybe we could make a "back translator" with an SQL type of language to create regex, instead of creating it in geekish.

Here's are a few examples that are doable:

class Hint: 
  Properties: HintType, HintString
  enum HintType { Separator, ParamDescription, NumberOfParameters }
  enum SampleType { FreeText, DateOrTime, Formatted, ... }
  public string RegexBySamples( List<T> samples, 
         List<SampleType> sampleTypes, 
         List<Hint> hints, 
         out string GeneralRegExp, out string description, 
         out string generalDescription)...

regex = RegExpBySamples( {"11/November/1999", "2/January/2003"}, 
                     SampleType.DateOrTime, 
                     new HintList( HintType.NumberOfParameters, 3 ));

regex = RegExpBySamples( "123-aaaaJ-1444", 
                         SampleType.Format, HintType.Seperator, "-" );

A GUI where you mark sample text or enter it, adding to the regex would be possible too. First you mark a date (the "sample"), and choose if this text is already formatted, or if you are building a format, also what the format type is: free text, formatted text, date, GUID or Choose... from existing formats (which you can store in library).

Lets design a spec for this, and make it open source... Anybody wants to join?

I've tried a very naive approach:

class RegexpGenerator {

    public static Pattern generateRegexp(String prototype) {
        return Pattern.compile(generateRegexpFrom(prototype));
    }

    private static String generateRegexpFrom(String prototype) {
        StringBuilder stringBuilder = new StringBuilder();

        for (int i = 0; i < prototype.length(); i++) {
            char c = prototype.charAt(i);

            if (Character.isDigit(c)) {
                stringBuilder.append("\\d");
            } else if (Character.isLetter(c)) {
                stringBuilder.append("\\w");
            } else { // falltrought: literal
                stringBuilder.append(c);
            }
        }

        return stringBuilder.toString();
    }

    private static void test(String prototype) {
        Pattern pattern = generateRegexp(prototype);
        System.out.println(String.format("%s -> %s", prototype, pattern));

        if (!pattern.matcher(prototype).matches()) {
            throw new AssertionError();
        }
    }

    public static void main(String[] args) {
        String[] prototypes = {
            "2009/11/12",
            "I'm a test",
            "me too!!!",
            "124.323.232.112",
            "ISBN 332212"
        };

        for (String prototype : prototypes) {
            test(prototype);
        }
    }
}

output:

2009/11/12 -> \\d\\d\\d\\d/\\d\\d/\\d\\d
I'm a test -> \\w'\\w \\w \\w\\w\\w\\w
me too!!! -> \\w\\w \\w\\w\\w!!!
124.323.232.112 -> \\d\\d\\d.\\d\\d\\d.\\d\\d\\d.\\d\\d\\d
ISBN 332212 -> \\w\\w\\w\\w \\d\\d\\d\\d\\d\\d

As already outlined by others a general solution to this problem is impossible. This class is applicable only in few contexts

Loreto pretty much does this. It's an open source implementation using the common longest substring(s) to generate the regular expressions. Needs multiple examples of course, though.

No, you cannot get a regex that matches what you want reliably, since the regex would not contain semantic information about the input (ie it would need to know it's generating a regex for dates). If the issue is with dates only I would recommend trying multiple regular expressions and see if one of them matches all.

I'm not sure if this is possible, at least not without many sample strings and some learning algorithm.

There are many regex' that would match and it's not possible for a simple algorithm to pick the 'right' one. You'd need to give it some delimiters or other things to look for, so you might as well just write the regex yourself.

sounds like a machine learning problem. You'll have to have more than one example on hand (many more) and an indication of whether or not each example is considered a match or not.

In addition to feeding the learning algorithm examples of "good" input, you could feed it "bad" input so it would know what not to look for. No letters in a phone number, for example.

我不记得名字了,但如果我的计算单元理论对我有用的话,理论上是不可能的:)

I haven't found anything that does it , but since the problem domain is relatively small (you'd be surprised how many people use the weirdest date formats) , I've able to write some kind of a "date regular expression generator". Once I'm satisfied with the unit tests , I'll publish it - just in case someone will ever need something of the kind.

Thanks to everyone who answered (the guy with the (.*) excluded - jokes are great , but this one was sssssssssoooo lame :) )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM