简体   繁体   中英

How to scrape the first paragraph from a wikipedia page?

Let's say I want to grab the first paragraph in this wikipedia page . How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar?

Is there any php library for that? I don't want to use the api because it's a bit complex.

Note: i just need that to add a widget under my pages that displays related info from Wikipedia.

Use the following XPath expression:

/*/h:body//h:h1
  |
   /*/h:body//h:h1/following::node()
      [count(. | //h:table[@id='toc']
                  /preceding::node()
             )
      =
       count(//h:table[@id='toc']
                  /preceding::node()
             )
       ]

Here the prefix h: is bound to the XHTML namespace ( "http://www.w3.org/1999/xhtml" ).

This transformation shows that the wanted result is really produced :

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:h="http://www.w3.org/1999/xhtml"
 >
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/h:body//h:h1
  |
   /*/h:body//h:h1/following::node()
      [count(. | //h:table[@id='toc']
                  /preceding::node()
             )
      =
       count(//h:table[@id='toc']
                  /preceding::node()
             )
       ]
  "/>
 </xsl:template>
</xsl:stylesheet>

When run on the XHTML document of the Wikipedia article ( you also need to define two entities &nbsp; and &reg; for this document), the wanted result is produced.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM