Extract string using regex

Question

How can I extract the content ( how are you ) from the string:

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>.

Can I use regex for the purpose? if possible whats suitable regex for it.

Note: I dont want to use split function for extract the result. Also can you suggest some links to learn regex for a beginner.

I am using python2.7.2

Answer 1

You could use a regular expression for this ( as Joey demonstrates ).

However if your XML document is any bigger than this one-liner you could not since XML is not a regular language .

Use BeautifulSoup (or another XML parser ) instead:

>>> from BeautifulSoup import BeautifulSoup
>>> xml_as_str = '<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>. '
>>> soup = BeautifulSoup(xml_as_str)
>>> print soup.text
how are you.

Or...

>>> for string_tag in soup.findAll('string'):
...     print string_tag.text
... 
how are you

Answer 2

(?<=<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">)[^<]+(?=</string>)

would match what you want, as a trivial example.

(?<=<)[^<]+

would, too. It all depends a bit on how your input is formatted exactly.

Answer 3

尝试使用以下正则表达式：

/<[^>]*>(.*?)</

Answer 4

This will match a generic HTML tag (Replace "string" with the tag you want to match):

/<string[^<]*>(.*?)<\/string>/i

(i=case insensitive)

Extract string using regex

Question

4 answers

solution1
2 ACCPTED 2012-01-27 10:23:41

Use BeautifulSoup (or another XML parser ) instead:

solution2
0 2012-01-27 10:24:25

solution3
0 2012-01-27 10:24:29

solution4
0 2012-01-27 10:35:48

Extract string using regex

Question

4 answers

solution1 2 ACCPTED 2012-01-27 10:23:41

Use BeautifulSoup (or another XML parser ) instead:

solution2 0 2012-01-27 10:24:25

solution3 0 2012-01-27 10:24:29

solution4 0 2012-01-27 10:35:48

solution1
2 ACCPTED 2012-01-27 10:23:41

solution2
0 2012-01-27 10:24:25

solution3
0 2012-01-27 10:24:29

solution4
0 2012-01-27 10:35:48