简体   繁体   中英

Only 1 match in capture groups standard C++ RegEx and PCRE

I have a big problem I can't seem to solve, I'm attempting to parse formdata retrieved by uWebSockets in C++.

I have decided to use a regular expression for this. The C++ standard function didn't work and took around about 5 minutes to run.

After also trying multiple languages , it looks to me like the problem is with C++ (or JavaScript for that matter) not allowing backtracing in capture groups as it works fine in any other language.

Switching to PCRE allowed 1 match to be made (and the result to be retrieved about 10x faster), but the rest are all still empty.

You can see PCRE (v2 and v1) working as expected.

Here's an example that portrays the problem well enough:

#include <pcrecpp.h>
#include <iostream>

int main() {
    std::string contents = "--------------------------eba4d02620bdb4f6\nContent-Disposition: form-data; name=\"ZIP\"; filename=\"h.png\"\nContent-Type: image/png\n\n--------------------------8c078fed966ff6fe\nContent-Disposition: form-data; name=\"ZIP\"; filename=\"tree-pack.xml\"\nContent-Type: application/xml\n\n<?xml version=\"1.0\"?>\n<Packages>\n  <Individual name=\"Designer\">\n    <Name>Designer</Name>\n    <Description>A BrAIn-API add-on that adds routes to help people design. This makes routes to generate colour palettes, generates fonts and even send previews of those to show how they look.</Description>\n    <ID></ID>\n    <FilePath>/packages/ID</FilePath>\n  </Individual>\n</Packages>\n\n--------------------------8c078fed966ff6fe--\n\n--------------------------eba4d02620bdb4f6--\n";
    pcrecpp::RE reg("-+.+\\nContent-Disposition: form-data; name=\"(\\w+| +)\"; filename=\"(.+)\"\\nContent-Type: (\\w+\\/\\w+)\\n\\n((.|\\n)+)\\n-+.+--.+|\\n+", pcrecpp::RE_Options()
    .set_caseless(true)
    .set_multiline(true));
    pcrecpp::StringPiece input(contents);
    int count = 0;
    std::string match;

    std::cout << contents << std::endl;

    while (reg.FindAndConsume(&input, &match)) { //This while loop makes sure that it only logs the amount of matches it is able to find; giving it a defined amount of matches it needs to find has the same output.
        count++;
        std::cout << count << " " << match << std::endl;
    }
}

I run it with g++ file.cpp -o file -lpcrecpp on Ubuntu 20.04.The output for me is:

 Content-Disposition: form-data; name="ZIP"; filename="h.png" Content-Type: image/png --------------------------8c078fed966ff6fe Content-Disposition: form-data; name="ZIP"; filename="tree-pack.xml" Content-Type: application/xml <?xml version="1.0"?> <Packages> <Individual name="Designer"> <Name>Designer</Name> <Description>A BrAIn-API add-on that adds routes to help people design. This makes routes to generate colour palettes, generates fonts and even send previews of those to show how they look.</Description> <ID></ID> <FilePath>/packages/ID</FilePath> </Individual> </Packages> --------------------------8c078fed966ff6fe-- --------------------------eba4d02620bdb4f6-- 1 ZIP 2

If you have any suggestions for libraries that already parse formdata or anything, I'd love to hear it as well.

Thanks for reading and thanks in advance for any and all help I can get!

In your links, the pattern seemed to fail the same as your example code. So I changed your pattern to work correctly, and altered the while () for clarity.

With these changes (see below), I now find both matches:

1 Content-Disposition: form-data; ZIP h.png
2 Content-Disposition: form-data; ZIP tree-pack.xml
  • I added a group around (Content-Disposition: form-data;) for use later.
  • After "Content-Type:", I changed from (.|\\n)+ to the non-greedy (.|\\n)+? .
  • I removed \\n-+.+--.+|\\n+ to replace with a negative lookahead of the first group: (?!\1) .
  • For demonstration purposes, I changed match to match[1-3] .

You had a pattern that greedily gobbled all future patterns

#include <pcrecpp.h>
#include <iostream>

int main() {
    std::string contents = "--------------------------eba4d02620bdb4f6\nContent-Disposition: form-data; name=\"ZIP\"; filename=\"h.png\"\nContent-Type: image/png\n\n--------------------------8c078fed966ff6fe\nContent-Disposition: form-data; name=\"ZIP\"; filename=\"tree-pack.xml\"\nContent-Type: application/xml\n\n<?xml version=\"1.0\"?>\n<Packages>\n  <Individual name=\"Designer\">\n    <Name>Designer</Name>\n    <Description>A BrAIn-API add-on that adds routes to help people design. This makes routes to generate colour palettes, generates fonts and even send previews of those to show how they look.</Description>\n    <ID></ID>\n    <FilePath>/packages/ID</FilePath>\n  </Individual>\n</Packages>\n\n--------------------------8c078fed966ff6fe--\n\n--------------------------eba4d02620bdb4f6--\n";
    pcrecpp::RE reg("-+.+\\n(Content-Disposition: form-data;) name=\"(\\w+| +)\"; filename=\"(.+)\"\\nContent-Type: (\\w+\\/\\w+)\\n\\n((.|\\n)+?)(?!\1)", pcrecpp::RE_Options()
    .set_caseless(true)
    .set_multiline(true));
    pcrecpp::StringPiece input(contents);
    int count = 0;
    std::string match1, match2, match3;

    std::cout << contents << std::endl;

    while (reg.FindAndConsume(&input, &match1, &match2, &match3)) {
        count++;
        std::cout << count << " " << match1
            << " " << match2 << " " << match3 << std::endl;
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM