Data scraping with wget and regex

Question

i'm just learning bash scripting, i was trying to scrape some data out of a site, mostly wikitionary. This is what I'm trying on the command line right now but it is not returning any result

wget -qO- http://en.wiktionary.org/wiki/robust | egrep '<ol>{[a-zA-Z]*[0-9]*}*</ol>'

What i'm trying is to get the data between the tags, just want them to be displayed. Can you please help me find out what I'm doing wrong ?

Thanks

Answer 1

you need to send output to stdout:

wget -q http://en.wiktionary.org/wiki/robust -q -O - | ...

to get all <ol> tags with grep you can do:

wget -q http://en.wiktionary.org/wiki/robust -O - | tr '\n' ' ' | grep -o '<ol>.*</ol>'

Answer 2

At least you need to

activate regular expressions by adding the -e switch.
send output from wget to stdout instead of to disk by adding the -O - option

Honestly, I'd say grep is the wrong tool for this task, since grep works on a per-line basis, and your expression stretches over several lines.

I think sed or awk would be a better fit for this task.

With sed it would look like

wget -O - -q http://en.wiktionary.org/wiki/robust | sed -n "/<ol>/,/<\/ol>/p"

If you want to get rid of the extra <ol> and </ol> you could do append

... | grep -v -E "</?ol>"

Related links

Answer 3

If I understand the question correctly then the goal is to extract the visible text content from within ol-sections. I would do it this way:

wget -qO- http://en.wiktionary.org/wiki/robust | 
  hxnormalize -x | 
  hxselect "ol" | 
  lynx -stdin -dump -nolist

[source: "Using the Linux Shell for Web Scraping"]

hxnormalize preprocesses the HTML code for hxselect which applies the CSS selector "ol". Lynx will render the code and reduce it to what is visible in a browser.

Data scraping with wget and regex

Question

3 answers

solution1
4 ACCPTED 2011-09-09 11:55:56

solution2
2 2011-09-09 11:54:49

solution3
1 2014-03-02 11:51:04

Data scraping with wget and regex

Question

3 answers

solution1 4 ACCPTED 2011-09-09 11:55:56

solution2 2 2011-09-09 11:54:49

solution3 1 2014-03-02 11:51:04

solution1
4 ACCPTED 2011-09-09 11:55:56

solution2
2 2011-09-09 11:54:49

solution3
1 2014-03-02 11:51:04