简体   繁体   中英

Filter out HTML code with grep

I am working on a project using a bash shell script. The idea is to grep a wget retrieved page, in order to pick up a certain paragraph on the web page. The area I would like to copy, usually starts with a

<p><b>

but the paragraph also contains other bits of HTML code, such as anchor tags, that I don't want to be in the output of the grep.
I have tried

cat page.html| grep "<p><b>" >grep.txt

and then I grep the output file, which now contains the paragraph I want

cat grep.txt|grep -v '<p>|<b>|<a>' >grep.txt

but then all it does is clear everything from the file and not read anything. How can I get it to exclude only the HTML code?

I am also trying to follow the links that are in the paragraph that I grep, in order to do the same thing with those pages. Only 2 levels deep, so the main page and then what ever sub page(s) stem from the first paragraph of the main page. I know this is a difficult idea, hopefully I explained well enough to get some help. If you have any ideas, any help is appreciated.

Do you have to do this in bash? It seems to me that Python would lend itself to this problem, in particular a library called Beautiful Soup .

I've used this for parsing HTML in the past and it's the easiest tool I could find. It has good documentation for dealing with html .

Perhaps you could make a standalone python code that extracts the HTML and then echos the string you're after. The python code could then be called from inside your bash script if you have some bash functions you want to perform on the string.

I know this is 7 years old but just posting solution I have with bash

https://api.jquery.com/jquery.grep/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM