简体   繁体   中英

Extract a specific line from a html request into a variable

Here's what I'm trying to do:

  • Create a GET request to load the HTML source

  • Search the source to find a string, if the string is found then extract the whole line into a variable

I've searched everywhere to find out how to do this but people only explain how to extract the whole source or using a dictionary.

For example, using the WWE Page:

Source: view-source: http://network.wwe.com/video/v2525697583?contextType=wwe-show&contextId=wwe_nxt_uk&contentId=300687284&watchlistAltButtonContext=series

I want to extract the line that includes this string

http://thumbs.media.net.wwe.com/wwe/

Code:

def extract(url):
    html = requests.get(url)
    text = html.text
    word = None
    for line in text:
        if 'http://thumbs.media.net.wwe.com/wwe/' in line:
            word = line
    return word

When I am carrying out the function the program will return None as first assigned.

NOTE I only need the first match, not every other match into the variable

This should work:

def extract(url):
    response = requests.get(url)
    searchstr = 'http://thumbs.media.net.wwe.com/wwe/' 
    for line in response.text.split("\n"):
        if searchstr in line:
            return line
    return None

Or, shorter:

def extract(url, searchstr):
    return next((line for line in requests.get(url).text.split("\n") if searchstr in line), None)

print(extract('http://www.url.com', 'http://thumbs.media.net.wwe.com/wwe/'))

Or even better with a regex :

def extract(url, searchstr):
    match = re.search(rf"^(.*{searchstr}.*)$", requests.get(url).text, re.MULTILINE)
    return match.group(1) if match else None

print(extract('http://www.url.com', 'http://thumbs.media.net.wwe.com/wwe/'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM