Regex of XML with multiple tags

Question

I'm trying to find all text that is not within the XML markup:

<transcript>
  <text start="9.75" dur="5.94">welcome to about my property here you
can learn more about how your property</text>
  <text start="15.69" dur="4.71">was assessed see the information impact
has on file and compare your property to</text>
  <text start="20.4" dur="1.3">others in your neighborhood</text>
  <text start="21.7" dur="5.32">interested in learning about market
trends in your municipality no problem</text>
  <text start="105.79" dur="6.23">I have all of this and more about life property
. see your property assessment know more</text>
  <text start="112.02" dur="0.11">about</text>
</transcript>

I am using the following regex pattern, but obviously it is not correct because it grabs all of the text between the opening and closing <transcript> tags:

<transcript>[\s\S]*?<\/transcript>

How can modify this regex pattern to select only the text that is not within any of the markup tags?

Answer 1

Use XSLT. XSLT is a language specifically designed to convert XML into another output format (back to valid XML again, or something else such as (X)HTML, plain text, or any other format – but preferably, based on plain text).

In this case the smallest XSLT necessary is just this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" >
<xsl:output method="text" indent="no" />

<xsl:template match="text">
    <!-- do NOTHING here! -->
</xsl:template>

</xsl:stylesheet>

This works because the default for processing a single XML tag is to recursively apply template matches to its containing tags, and plain text will always be copied. The only tag inside your <template> is <text> , and you process it by doing 'nothing' – ie, by not copying its contents to the output. The line inside that template is just a comment.

All other "nodes", in XML terminology, are those without a surrounding tag and so are copied to the output.

Alternatively, if you have more types of tags than just <text> elements and you want to skip all of them, apply templates to / and transcript to process each and apply another to * (which will select all remaining tags not specified elsewhere) to not process them:

<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" >
<xsl:output method="text" indent="no" />

<xsl:template match="/">
    <xsl:apply-templates />
</xsl:template>

<xsl:template match="transcript">
    <xsl:apply-templates />
</xsl:template>

<xsl:template match="*">
    <!-- do NOTHING here! -->
</xsl:template>

</xsl:stylesheet>

Again, the plain untagged text will fall through and not get processed, so their contents will be copied to output.

Both XSLT stylesheets will output only I ha , the only part in your sample text that is not surrounded by tags.

Answer 2

Do you want to find

welcome to about my property here you can learn more about how your property

from

<text start="9.75" dur="5.94">welcome to about my property here you can learn more about how your property</text>

??

Than it will work.

(?<=>).+?(?=<)

Regex of XML with multiple tags

Question

2 answers

solution1
1 2016-04-01 22:53:31

solution2
0 2016-04-02 00:58:03

Regex of XML with multiple tags

Question

2 answers

solution1 1 2016-04-01 22:53:31

solution2 0 2016-04-02 00:58:03

solution1
1 2016-04-01 22:53:31

solution2
0 2016-04-02 00:58:03