简体   繁体   English

使用curl和grep / sed / awk在HTML标签中获取时间

[英]Get time in HTML tags using curl and grep/sed/awk

I'm trying to extract just the arrival times from this web page. 我试图从此网页中仅提取到达时间。 I'm running this in terminal on OSX 10.9.5 我在OSX 10.9.5的终端中运行它

http://www.flyokc.com/Arrivals.aspx http://www.flyokc.com/Arrivals.aspx

I've come as far as isolating just the tags 我只是隔离了标签

curl 'www.flyokc.com/arrivals.aspx' | grep 'labelTime'

However, I'm terrible at RegEx so I haven't figured out just to grab the times from these tags. 但是,我在RegEx上很糟糕,所以我并没有想过要从这些标签中抢占时间。 Thoughts on how I can do that? 关于我该怎么做的想法?

Eventually, I'd like to group them by the hour of the day and display the number of arrivals by hour, in descending order 最终,我想将它们按一天中的小时分组,并按小时降序显示到达的次数

Parsing HTML/XML with regex is bad. 用正则表达式解析HTML / XML是不好的。 That being sad, this seems to work at this moment for your use case: 令人难过的是,这似乎适用于您的用例:

gawk '
BEGIN{
    PROCINFO["sorted_in"]="@ind_num_asc"
    FS="[<>: ]+"
}
/labelTime/&&/ContentPlaceHolderMain/{
    if($6="PM") a[$4+12]+=1
    else a[$4]+=1
}
END{
    for(h in a)
        print h, a[h]
}' <(curl 'www.flyokc.com/arrivals.aspx' 2>/dev/null)

Edit: An account of what works why: 编辑:什么起作用的原因为何:

  • Set the field separator to the html delimiters, spacing, and HH:MM seperator. 将字段分隔符设置为html分隔符,间距和HH:MM分隔符。

  • Then grab the sixth field (Hours) (this is only in a sense a regex what you asked for...) 然后抓住第六个字段(小时)(在某种意义上,这是正则表达式所要求的...)

  • If the sixth field is "PM", add 12 hours to it (you want to sort numerically in the end). 如果第六个字段是“ PM”,则在其上添加12个小时(您希望最后进行数字排序)。 +1 the count for that hour. +1该小时的计数。

  • After processing of input, display the results. 处理输入后,显示结果。 Because the array access order has been defined to sort numerically on the keys, no need to external sort commands are necessary. 因为已经定义了数组访问顺序以对键进行数字排序,所以不需要外部排序命令。

If you're simply looking to grab the arrival times such as 12:00 PM, etc. awk with curl should work: 如果您只是想抓住到达时间(例如12:00 PM等),那么带有curl awk应该可以工作:

curl -s 'http://flyokc.com/arrivals.aspx' | awk '/labelTime/{print substr($2,68,5),substr($3,1,2)}'

Output: 输出:

12:47 PM
...

How it works: 这个怎么运作:

CURL silently grabs the source of the webpage, then AWK takes the output and uses "labelTime" to pick out the line which contains the arrival times. CURL默默地获取网页的来源,然后AWK获取输出并使用“ labelTime”来选择包含到达时间的行。 Since awk grabs the entire <span> where the string resides, substring is used to start at position 68, then the result is printed. 由于awk会捕获字符串所在的整个<span> ,因此将使用子字符串从位置68开始,然后打印结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM