简体   繁体   中英

Python regexp find two keywords in a line

I'm having a hard time understanding this regex stuff...

I have a string like this:

<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">

I want to use findall() and groups to get this:

['56242','saddelmageri']

I can match the number with something like "synset-[0-9]" and the word with something like "{(.*?)}" but how do I write it to get the above result?

And here's a follow-up question - some lines look like this:

<wn20schema:NounSynset rdf:about="&dn;synset-2589" rdfs:label="**{cykel_3: trehjulet cykel; tricykel,1_1}**">

In this case I want to extract the stuff between the {} with this result:

['2589', ['cykel', 'trehjulet cykel', 'tricykel']]

so that I can drop it in a dictionary later as a key(2589) : value(['cykel', 'trehjulet cykel', 'tricykel']) pair.

Any thoughts?

Please see the top answer to this question . It is generally a terrible idea to parse xml with regular expressions. XML parsers are built for this purpose.

The quickest way to do this would probably be python's built-in minidom

Since this appears to be xml data, you would be better off using an xml parser, since parsing xml with regular expressions is very, very difficult to do right.

However, since you specifically asked for a regular expression...

Your specifications are a bit imprecise, and with regular expressions you need to be very precise in what constitutes a match. For example, will the rdfs:label value always have a _1 that you want to strip off? Will there always only be one of these blocks of data per line, or multiple per line? Also, is the order of the result important?

Here's a quick hack that might give you close to what you want:

import re
data=r'<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">"'

matches=re.findall('synset-([0-9]+).*label="{(.*)_1}"', data)
print "matches:", matches

When I run the above, I get the following output, which is a list of two-tuples containing the two strings you wanted (though in a different order):

matches: [('56242', 'saddelmageri')]

If you do a lot with this data, consider even a specialized RDF library (eg RDFLib). If not, an XML parser is definitely the way to go!

  • What if tomorrow it won't be on a single line?
  • What if tomorrow the label will come before the about ?
  • There are at a least a dozen more ways in which it can remain valid XML but break your regexp!

Anyway, I tried applying an XML parser, but I'm getting an "undefined entity error" for the &dn; there. Can you post the top of the file (doctype, namespace definitions, and the like)?

You're doing two different kinds of parsing here, and you'll need to use two different tools.

First, you're parsing XML. For that, you're going to need to use an XML parser, not regular expressions. Because these elements are functionally identical XML:

<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}">
</wn20schema:NounSysnset>

<wn20schema:NounSynset rdf:about="&dn;synset-56242" rdfs:label="{saddelmageri_1}"/>

<wn20schema:NounSynset rdfs:label="{saddelmageri_1}" rdf:about="&dn;synset-56242"/>

and conceivably even:

<NounSynset xmlns="my_wn20schema_namespace_urn" C:label='not_of_interest' A:label='{saddelmageri_1}' B:about='&dn;synset-56242'/>

To parse that element, you need to know the names of the namespaces that the element and the attributes you're interested in belong to, and then use an XML parser to find them - specifically, an XML parser that properly supports XML namespaces and XPath, like lxml .

You'll end up with something like this to find the attributes you're looking for (assuming that doc is the parsed XML document, and that variables ending in _urn are strings containing the various namespace URNs):

def find_attributes(doc):
    for elm in doc.xpath('//x:NounSynset', namespaces={'x': wn20schema_namespace_urn}):
        yield (elm.get(rdf_namespace_urn + "about"), elm.get(rdfs_namespace_urn + "label"))

Now you can look at the second part of the problem, which is parsing the values you need out of the attribute values you have. For that, you would use regular expressions. To parse the about attribute, this might work:

re.match(r'[^\d]*(\d*)', about).groups()[0]

which returns the first series of digit characters found. And to parse the label attribute, you might use:

re.match(r'{([^_]*)', label).groups()[0]

which returns all characters in label after a leading left brace and up to but not including the first underscore. (As far as parsing the second form of label that you posted, you haven't posted enough information for me to guess what a regular expression to parse that would look like.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM