简体   繁体   中英

Parsing a site using Regex in Python

I am trying to use regex to parse a site for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah 

(there are many of these, and I want all of them in some tokenized form). The problem is that "a href" actually has TWO spaces, not just one (there are some that are "a href" with one space that I do NOT want to retrieve), so using LXML has proven to be quite a pain and I do not want to use BeautifulSoup (for other reasons). Does anyone know how I might go about doing this?

Thanks!

Depending on the level of robustness you want, you can fetch the tag in a first shot and store it, then replace " " to " " while your string contains " ". This will effectively remove any multiple spaces in your string.

It is to note that using regex to parse HTML is not recommended =)

Don't let you be impressed by the answer whose link is given each time someone asks the same question as you. It's apparently considered as a page of catechism that is semi-automatically cited by plenty of people. However, in programming, it's like in everyday life, there is the catechism, and there is what we do in the real days.
Personally, if I don't consider that HTML can be entirely parsed with regex, I esteem that limited analysis of certain parts of HTML can be done with regex. That's a pragmatical point of view.
And I do realize such analysises of web pages with regex. There are some problems, sometimes, but they can be managed by a developper. Regex are fast. One time I measured that Beautiful Soup was 10 times slower than a regex, and that lxml was around 50 times slower.
I'm relatively skilled to fetch web dat with regexes, if you would like to have hints, I could give some, my email is on my page.

I believe this answers your question. It is just a couple of regular expressions that will get all of the href's that are exactly two spaces after an opening 'a' tag.

fh = open("index.html", 'r')
rawString = fh.read()   # read entire file to string
fh.close()

temp =  re.findall("<a  href=\".*?\"", rawString) 
if temp:
    for i in range(len(temp)): # process each match
        temp[i] = re.search("\".*?\"", temp[i]).group(0) # remove 'href='
    print temp    
else:
    print "Not found"

For your example the output is:

['"THIS IS WHAT I WANT"']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM