简体   繁体   中英

Python regular expression for PHP array parsing

I have a function which parses PHP array declarations from files. The function then returns a dictionary with the keys being the keys of the PHP array and the values in python are the values from the PHP array.

Example file:

$lang['identifier_a'] = 'Welcome message';
$lang['identifier_b'] = 'Welcome message.
You can do things a,b, and c here.

Please be patient.';
$lang['identifier_c'] = 'Welcome message2.
You can do things a,b, and c here.
Please be patient.';
$lang['identifier_d'] = 'Long General Terms and Conditions with more text';
$lang['identifier_e'] = 'General Terms and Conditions';
$lang['identifier_f'] = 'Text e';

Python function

def fetch_lang_keys(filename):
    from re import search;
    import mmap;

    ''' fetches all the language keys for filename '''
    with open(filename) as fi:
        lines = fi.readlines();

    data = {};
    for line in lines:
        obj = search("\$lang\[[\'|\"](.{1,})[\'|\"]\] = [\'|\"](.{1,})[\'|\"];", line);
#        re.match(r'''\$lang\[[\'|\"](.{1,})[\'|\"]\] = [\'|\"](.{1,})[\'|\"];''', re.MULTILINE | re.VERBOSE);

        if obj:
            data[obj.group(1)] = obj.group(2);

    return data;

This function should return a dictionary which should look like this:

data['identifier_a'] = 'Welcome message'
data['identifier_b'] = 'Welcome message.
You can do things a,b, and c here.

Please be patient.';
// and so on

The regexp which is used in the function works for everything except for identifier_b and identifier_c , because the regular expression does not match blank lines and/or lines which do not end with ;. The wildcard operator with ; at the end did work either, because it matched too much.

Do you have any idea of how to solve this? I looked into lookahead assertions, but failed to use them properly. Thanks.

Well, why my answer is not a solution for your regexp problem, but nevertheless: why don't you wish to use a "real PHP parser" instead of home-brew regexp's? It could be much more reliable and might even be faster, and certainly a more maintainable solution.

Quick googling gave me: https://github.com/ramen/phply . But also I've found this: Parse PHP file variables from Python script . Hope this help.

It doesn't work because the dot doesn't match newlines. You must use the singleline modifier ( re.DOTALL ) instead of the multiline modifier. Example:

obj = re.search(r'\$lang\[[\'"](.+?)[\'"]\] = [\'"](.+?)[\'"];', line, re.DOTALL);

This regex seems to work. -

\$lang\[[\'|\"](.{1,})[\'|\"]\] = [\'|\"]((?:.|\n)+?)[\'|\"];
                                          ^^^^^^^^^^

Demo here-

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM