简体   繁体   中英

Word Frequency Counter in XSLT

I am trying to make a word frequency counter in XSLT. I want it to use stop words. I got started with Michael Kay's book . But I have trouble getting the stop words to work.

This code will work on any source XML file.

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet
   version="2.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/">   
    <xsl:variable name="stopwords" select="'a about an are as at be by for from how I in is it of on or that the this to was what when where who will with'"/>
     <wordcount>
        <xsl:for-each-group group-by="." select="
            for $w in //text()/tokenize(., '\W+')[not(.=$stopwords)] return $w">
            <word word="{current-grouping-key()}" frequency="{count(current-group())}"/>
        </xsl:for-each-group>
     </wordcount>
</xsl:template>

</xsl:stylesheet>

I think the not(.=$stopwords) is where my problem is. But I'm not sure what to do about it.

Also I'll take hints on how to load the stop words from a external file.

Your $stopwords variable is now a single string; you want it to be a sequence of strings. You can do this in any of the following ways:

  • Change its declaration to

     <xsl:variable name="stopwords" select="('a', 'about', 'an', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'how', 'I', 'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'this', 'to', 'was', 'what', 'when', 'where', 'who', 'will', 'with')"/> 
  • Change its declaration to

     <xsl:variable name="stopwords" select="tokenize('a about an are as at be by for from how I in is it of on or that the this to was what when where who will with', '\\s+')"/> 
  • Read it from an external XML document named (eg) stoplist.xml, of the form

     <stop-list> <p>This is a sample stop list [further description ...]</p> <w>a</w> <w>about</w> ... </stop-list> 

    and then load it, eg with

     <xsl:variable name="stopwords" select="document('stopwords.xml')//w/string()"/> 

You are comparing the current word with the entire list of all stop words, instead you should check if the current word is contained in the list of stop words:

not(contains(concat($stopwords,' '),concat(.,' '))

The concatenation of a space is needed to avoid partial matches - eg prevent 'abo' to match 'about'.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM