How to scrape the first paragraph from a wikipedia page?

Question

Let's say I want to grab the first paragraph in this wikipedia page . How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar?

Is there any php library for that? I don't want to use the api because it's a bit complex.

Note: i just need that to add a widget under my pages that displays related info from Wikipedia.

Answer 1

Use the following XPath expression:

/*/h:body//h:h1
  |
   /*/h:body//h:h1/following::node()
      [count(. | //h:table[@id='toc']
                  /preceding::node()
             )
      =
       count(//h:table[@id='toc']
                  /preceding::node()
             )
       ]

Here the prefix h: is bound to the XHTML namespace ( "http://www.w3.org/1999/xhtml" ).

This transformation shows that the wanted result is really produced :

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:h="http://www.w3.org/1999/xhtml"
 >
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/h:body//h:h1
  |
   /*/h:body//h:h1/following::node()
      [count(. | //h:table[@id='toc']
                  /preceding::node()
             )
      =
       count(//h:table[@id='toc']
                  /preceding::node()
             )
       ]
  "/>
 </xsl:template>
</xsl:stylesheet>

When run on the XHTML document of the Wikipedia article ( you also need to define two entities   and ® for this document), the wanted result is produced.

How to scrape the first paragraph from a wikipedia page?

Question

1 answers

solution1
0 2010-05-10 02:25:17

How to scrape the first paragraph from a wikipedia page?

Question

1 answers

solution1 0 2010-05-10 02:25:17

solution1
0 2010-05-10 02:25:17