简体   繁体   中英

Regex of XML with multiple tags

I'm trying to find all text that is not within the XML markup:

<transcript>
  <text start="9.75" dur="5.94">welcome to about my property here you
can learn more about how your property</text>
  <text start="15.69" dur="4.71">was assessed see the information impact
has on file and compare your property to</text>
  <text start="20.4" dur="1.3">others in your neighborhood</text>
  <text start="21.7" dur="5.32">interested in learning about market
trends in your municipality no problem</text>
  <text start="105.79" dur="6.23">I have all of this and more about life property
. see your property assessment know more</text>
  <text start="112.02" dur="0.11">about</text>
</transcript>

I am using the following regex pattern, but obviously it is not correct because it grabs all of the text between the opening and closing <transcript> tags:

<transcript>[\s\S]*?<\/transcript>

How can modify this regex pattern to select only the text that is not within any of the markup tags?

Use XSLT. XSLT is a language specifically designed to convert XML into another output format (back to valid XML again, or something else such as (X)HTML, plain text, or any other format – but preferably, based on plain text).

In this case the smallest XSLT necessary is just this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" >
<xsl:output method="text" indent="no" />

<xsl:template match="text">
    <!-- do NOTHING here! -->
</xsl:template>

</xsl:stylesheet>

This works because the default for processing a single XML tag is to recursively apply template matches to its containing tags, and plain text will always be copied. The only tag inside your <template> is <text> , and you process it by doing 'nothing' – ie, by not copying its contents to the output. The line inside that template is just a comment.

All other "nodes", in XML terminology, are those without a surrounding tag and so are copied to the output.

Alternatively, if you have more types of tags than just <text> elements and you want to skip all of them, apply templates to / and transcript to process each and apply another to * (which will select all remaining tags not specified elsewhere) to not process them:

<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" >
<xsl:output method="text" indent="no" />

<xsl:template match="/">
    <xsl:apply-templates />
</xsl:template>

<xsl:template match="transcript">
    <xsl:apply-templates />
</xsl:template>

<xsl:template match="*">
    <!-- do NOTHING here! -->
</xsl:template>

</xsl:stylesheet>

Again, the plain untagged text will fall through and not get processed, so their contents will be copied to output.

Both XSLT stylesheets will output only I ha , the only part in your sample text that is not surrounded by tags.

Do you want to find

welcome to about my property here you can learn more about how your property

from

<text start="9.75" dur="5.94">welcome to about my property here you can learn more about how your property</text>

??

Than it will work.

(?<=>).+?(?=<)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM