如何使用 perl 下载 txt 网页内容

Question

I am trying to download data from this data page .我正在尝试从此数据页下载数据。 I have tried a number of scripts I googled.我已经尝试了一些我在谷歌上搜索过的脚本。 On the data page I have to select the countries I want, one at a time.在数据页面上，我必须一次选择一个国家。 The one script which gets close to what I want is:接近我想要的一个脚本是：

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
my $file = 'Zamb.txt';
getstore($url, $file);

However this script gives me the page, not the data.但是这个脚本给了我页面，而不是数据。 I would appreciate if I can get help to download the data, if this is possible.如果可能的话，如果我能得到下载数据的帮助，我将不胜感激。 I would also appreciate to do it in php if this may be an easier alternative.如果这可能是一个更简单的选择，我也很感激在 php 中进行。

Answer 1

The link returns text wrapped in HTML.该链接返回用 HTML 包装的文本。 Simplest approach would be to use HTML::FormatText and HTML::Parse to get the text only version.最简单的方法是使用 HTML::FormatText 和 HTML::Parse 来获取纯文本版本。

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;
use HTML::FormatText;


my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
my $text = HTML::FormatText->new(leftmargin=>0, rightmargin=>100000000000)->format(HTML::TreeBuilder->new_from_url($url));

my $file = 'Zamb.txt';
open (my $fh, '>', $file);
print $fh $text;
close ($fh);

HTML::TreeBuilder->new_from_url($url) - download and parse the html HTML::TreeBuilder->new_from_url($url) - 下载并解析 html
HTML::FormatText ->new(leftmargin=>0, rightmargin=>100000000000) - intialize the html format - set the right margin to a big value to prevent wrapping HTML::FormatText ->new(leftmargin=>0, rightmargin=>100000000000) - 初始化 html 格式 - 将右边距设置为大值以防止换行

This is the content of Zamb.txt afterwards.这是之后Zamb.txt的内容。

 $ cat Zamb.txt
##########################################################
# Query made at 02/29/2020 18:15:54 UTC
##########################################################

##########################################################
# latest SYNOP reports from Zambia before 02/29/2020 18:15:54 UTC
##########################################################
202002291200 AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201
                   333 5//// 85850 83080=

My php fu isn't up to date, but for PHP, I think you can use the following:我的 php fu 不是最新的，但是对于 PHP，我认为您可以使用以下内容：

<?php
$url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
$content = strip_tags(file_get_contents($url));
echo substr($content, strpos($content, '###############'));

Note: I seem to recall that there are some configuration options that might disable fetching URL via file_get_contents so YMMV.注意：我似乎记得有一些配置选项可能会禁用通过 file_get_contents 获取 URL 所以 YMMV。

However, the same page there is a note:但是，同页有一个注释：

NOTE: If you want to get simply files with synop reports in CSV format without HTML tags consider to use the binary getsynop注意：如果您想获得带有 CSV 格式的 Synop 报告的简单文件而没有 HTML 标签，请考虑使用二进制 getsynop

This would get you the same data in a easy to use format:这将以易于使用的格式为您提供相同的数据：

$ wget "https://www.ogimet.com/cgi-bin/getsynop?begin=$(date +%Y%m%d0000)&state=Zambia" -o /dev/null -O - | tail -1
67855,2020,02,29,12,00,AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201 333 5//// 85850 83080=

如何使用 perl 下载 txt 网页内容

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-02-29 19:01:38

如何使用 perl 下载 txt 网页内容

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-02-29 19:01:38

解决方案1
1 已采纳 2020-02-29 19:01:38