简体   繁体   English

如何收集多个网页的h1标题?

[英]How do I collect the h1 headings of a number of web pages?

I would like to go through a couple of web pages 我想浏览几个网页

 theURLs := #('url1' 'url2' 'url3')

and get the content of the first h1 heading 并获取第一个h1标题的内容

theURLs collect: [ :anURL |  page := HTTPClient httpGetDocument: anURL.
                             page firstH1heading].

Question

What do I need to put at the place of #firstH1heading ? 我需要在#firstH1heading位置放置什么?

Answers for Squeak / Pharo / Cuis are welcome. 欢迎回答Squeak / Pharo / Cuis。

Note 注意

In Squeak 在吱吱声中

HTTPClient httpGetDocument: 'http://pharo.org/'

gives back a 退还一个

MIMEDocument

So I would expect to do something like 所以我希望做类似的事情

theURLs collect: [ :anURL |  page := HTMLDocument on: 
                                     (HTTPClient httpGetDocument: anURL).
                             page firstH1heading].

But in Squeak 4.6 there is no HTMLDocument class though it seems there used to be one. 但是在Squeak 4.6中,虽然似乎曾经有一个HTMLDocument类,但它没有。 ( http://wiki.squeak.org/squeak/2249 ). http://wiki.squeak.org/squeak/2249 )。 The Wiki says that I should load a package Network-HTML . Wiki说我应该加载一个Network-HTML包。 The SqueakMap catalog of Squeak 4.6 has a package 'XMLParser-HTML'. Squeak 4.6的SqueakMap目录具有“ XMLParser-HTML”包。 Can this be used instead? 可以代替使用吗?

In Pharo, you can use the Soup package . 在Pharo中,您可以使用Soup包 Install it via the Configuration Browser. 通过配置浏览器安装它。

You retrieve a document from an URL with Zinc, and find the first <h1> tag with Soup like this: 您可以使用Zinc从URL中检索文档,并使用Soup查找第一个<h1>标签,如下所示:

|contents soup body|
contents := ZnClient new get: 'http://zn.stfx.eu/zn/small.html'.
soup := Soup fromString: contents.
body := soup body.
body findTag: 'h1'

I've updated the configuration. 我已经更新了配置。 You might need to refresh the catalog 您可能需要刷新目录

Name: ConfigurationOfSoup-StephanEggermont.75
Author: StephanEggermont
Time: 14 December 2015, 1:39:52.307715 pm
UUID: 6c11fb83-5299-4852-9563-73ecc34992a0
Ancestors: ConfigurationOfSoup-FrancoisStephany.74

Adopted bug fix to stable 1.7.1 , added Pharo 5 versions

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM