简体   繁体   中英

Manually extracting portions of strings contained in a list (parsing)

I'm aware that there are modules that fully simplify this function, but saying that I am running from a base install of python (standard modules only), how would I extract the following:

I have a list. This list is the contents, line by line, of a webpage. Here is a mock up list (unformatted) for informative purposes:

<script>
    link = "/scripts/playlists/1/" + a.id + "/0-5417069212.asx";
<script>

"<a href="/apps/audio/?feedId=11065"><span class="px13">Eastern Metro Area Fire</span>"

From the above string, I need the following extracted. The feedId (11065), which is incidentally a.id in the code above., "/scripts/playlists/1/" and "/0-5417069212.asx". Remembering that each of these lines is just contents from objects in a list, how would I go about extracting that data?

Here is the full list:

contents = urllib2.urlopen("http://www.radioreference.com/apps/audio/?ctid=5586")

Pseudo:

from urllib2 import urlopen as getpage
page_contents = getpage("http://www.radioreference.com/apps/audio/?ctid=5586")

feedID        = % in (page_contents.search() for "/apps/audio/?feedId=%")
titleID       = % in (page_contents.search() for "<span class="px13">%</span>")
playlistID    = % in (page_contents.search() for "link = "%" + a.id + "*.asx";")
asxID         = * in (page_contents.search() for "link = "*" + a.id + "%.asx";")

streamURL     = "http://www.radioreference.com/" + playlistID + feedID + asxID + ".asx"

I plan to format it as such that streamURL should = :

http://www.radioreference.com/scripts/playlists/1/11065/0-5417067072.asx

I'd do this with regular expressions. Python's re module is great!

However, it's easier (and faster) to search a single string holding all the page's text (rather than doing repeated searches line by line). If you can, do a read() on the file-like object you get when you open the URL, rather than readlines() (or directly iterating over the file object). If you can't do that, you can use "\\n".join(list_of_strings) to get the lines back into a single string.

Here's some code that works for me on your example URL:

from urllib2 import urlopen
import re

contents = urlopen("http://www.radioreference.com/apps/audio/?ctid=5586").read()

playlist_pattern = r'link = "([^"]+)" \+ a.id \+ "([^"]+\.asx)'
feed_pattern = r'href="/apps/audio/\?feedId=(\d+)"><span class="px13">([^<]+)'
pattern = playlist_pattern + ".*" + feed_pattern

playlist, asx, feed, title = re.search(pattern, contents, re.DOTALL).groups()

streamURL = "http://www.radioreference.com" + playlist + feed + asx

print title
print streamURL

Output:

Eastern Metro Area Fire
http://www.radioreference.com/scripts/playlists/1/11065/0-5417090148.asx

It's not strictly necessary to do all the matching in one pass. You can use playlist_pattern and feed_pattern to get two parts each, if you want. It is a little more difficult to split either of the halves up though, since you'll start running into extra matches for some of the pieces (there are several identical link = "stuff" sections, for instance).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM