使用wget和regex进行数据抓取

Question

i'm just learning bash scripting, i was trying to scrape some data out of a site, mostly wikitionary. 我只是在学习bash脚本，我试图从网站上抓取一些数据，主要是wikitionary。 This is what I'm trying on the command line right now but it is not returning any result 这就是我现在在命令行上尝试但它没有返回任何结果

wget -qO- http://en.wiktionary.org/wiki/robust | egrep '<ol>{[a-zA-Z]*[0-9]*}*</ol>'

What i'm trying is to get the data between the tags, just want them to be displayed. 我正在尝试的是获取标签之间的数据，只是希望它们被显示。 Can you please help me find out what I'm doing wrong ? 能帮我看看我做错了吗？

Thanks 谢谢

Answer 1

you need to send output to stdout: 你需要将输出发送到stdout：

wget -q http://en.wiktionary.org/wiki/robust -q -O - | ...

to get all <ol> tags with grep you can do: 要使用grep获取所有<ol>标记，您可以执行以下操作：

wget -q http://en.wiktionary.org/wiki/robust -O - | tr '\n' ' ' | grep -o '<ol>.*</ol>'

Answer 2

At least you need to 至少你需要

activate regular expressions by adding the -e switch. 通过添加-e开关激活正则表达式。
send output from wget to stdout instead of to disk by adding the -O - option 通过添加-O -选项将输出从wget发送到stdout而不是磁盘

Honestly, I'd say grep is the wrong tool for this task, since grep works on a per-line basis, and your expression stretches over several lines. 老实说，我会说grep是这个任务的错误工具，因为grep在每行基础上工作，并且你的表达式延伸了几行。

I think sed or awk would be a better fit for this task. 我认为sed或awk更适合这项任务。

With sed it would look like 用sed看起来像

wget -O - -q http://en.wiktionary.org/wiki/robust | sed -n "/<ol>/,/<\/ol>/p"

If you want to get rid of the extra <ol> and </ol> you could do append 如果你想摆脱额外的<ol>和</ol>你可以追加

... | grep -v -E "</?ol>"

Related links 相关链接

Answer 3

If I understand the question correctly then the goal is to extract the visible text content from within ol-sections. 如果我正确理解了这个问题，那么目标就是从ol-sections中提取可见文本内容。 I would do it this way: 我会这样做：

wget -qO- http://en.wiktionary.org/wiki/robust | 
  hxnormalize -x | 
  hxselect "ol" | 
  lynx -stdin -dump -nolist

[source: "Using the Linux Shell for Web Scraping"] [来源： “使用Linux Shell进行网页搜刮”]

hxnormalize preprocesses the HTML code for hxselect which applies the CSS selector "ol". hxnormalize预处理应用CSS选择器“ol”的hxselect的HTML代码。 Lynx will render the code and reduce it to what is visible in a browser. Lynx将呈现代码并将其减少到浏览器中可见的内容。

使用wget和regex进行数据抓取

问题描述

3 个解决方案

解决方案1
4 已采纳 2011-09-09 11:55:56

解决方案2
2 2011-09-09 11:54:49

解决方案3
1 2014-03-02 11:51:04

使用wget和regex进行数据抓取

问题描述

3 个解决方案

解决方案1 4 已采纳 2011-09-09 11:55:56

解决方案2 2 2011-09-09 11:54:49

解决方案3 1 2014-03-02 11:51:04

解决方案1
4 已采纳 2011-09-09 11:55:56

解决方案2
2 2011-09-09 11:54:49

解决方案3
1 2014-03-02 11:51:04