简体   繁体   中英

How do I replace a specific part of a string in Python

As of now I am trying to scrape Good.is.The code as of now gives me the regular image(turn the if statement to True) but I want to higher res picture. I was wondering how I would replace a certain text so that I could download the high res picture. I want to change the html: http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html to http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html (The end is different). My code is:

import os, urllib, urllib2
from BeautifulSoup import BeautifulSoup
import HTMLParser

parser = HTMLParser.HTMLParser()

# make folder.
folderName = 'Good.is'
if not os.path.exists(folderName):
  os.makedirs(folderName)


list = [] 
# Python ranges start from the first argument and iterate up to one
# less than the second argument, so we need 36 + 1 = 37
for i in range(1, 37):
    list.append("http://www.good.is/infographics/page:" + str(i) + "/sort:recent/range:all")


listIterator1 = []
listIterator1[:] = range(0,37)      
counter = 0


for x in listIterator1:


    soup = BeautifulSoup(urllib2.urlopen(list[x]).read())

    body = soup.findAll("ul", attrs = {'id': 'gallery_list_elements'})

    number = len(body[0].findAll("p"))
    listIterator = []
    listIterator[:] = range(0,number)        

    for i in listIterator:
        paragraphs = body[0].findAll("p")
        nextArticle = body[0].findAll("a")[2]
        text = body[0].findAll("p")[i]

        if len(paragraphs) > 0:
            #print image['src']
            counter += 1
            print counter
            print parser.unescape(text.getText())
            print "http://www.good.is" + nextArticle['href']
            originalArticle = "http://www.good.is" + nextArticle['href']
            article = BeautifulSoup(urllib2.urlopen(originalArticle).read())
            title = article.findAll("div", attrs = {'class': 'title_and_image'})
            getTitle = title[0].findAll("h1") 
            article1 = article.findAll("div", attrs = {'class': 'body'})
            articleImage = article1[0].find("p")
            betterImage = articleImage.find("a")
            articleImage1 = articleImage.find("img")
            paragraphsWithinSection = article1[0].findAll("p")
            print betterImage['href']
            if len(paragraphsWithinSection) > 1:
                articleText = article1[0].findAll("p")[1]
            else:
                articleText = article1[0].findAll("p")[0]
            print articleImage1['src']
            print parser.unescape(getTitle)
            if not articleText is None:
                print parser.unescape(articleText.getText())
            print '\n'
            link = articleImage1['src']
            x += 1


            actually_download = False
            if actually_download:
                filename = link.split('/')[-1]
                urllib.urlretrieve(link, filename)

Have a look at str.replace . If that isn't general enough to get the job done, you'll need to use a regular expression ( re -- probably re.sub ).

>>> str1="http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html"
>>> str1.replace("flash","flat")
'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html'

I think the safest and easiest way is to use a regular expression:

import re
url = 'http://www.google.com/this/is/sample/url/flash.html'
newUrl = re.sub('flash\.html$','flat.html',url)

The "$" means only match the end of the string. This solution will behave correctly even in the (admittedly unlikely) event that your url includes the substring "flash.html" somewhere other than the end, and also leaves the string unchanged (which I assume is the correct behavior) if it does not end with 'flash.html'.

See: http://docs.python.org/library/re.html#re.sub

@mgilson has a good solution, but the problem is it will replace all occurrences of the string with the replacement; so if you have the word "flash" as part of the URL (and not the just the trailing file name), you'll have multiple replacements:

>>> str = 'hello there hello'
>>> str.replace('hello','world')
'world there world' 

An alternate solution is to replace the last part after / with flat.html :

>>> url = 'http://www.google.com/this/is/sample/url/flash.html'
>>> url[:url.rfind('/')+1]+'flat.html'
'http://www.google.com/this/is/sample/url/flat.html'

Using urlparse you can do a few bits and bobs:

from urlparse import urlsplit, urlunsplit, urljoin

s = 'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html'

url = urlsplit(s)
head, tail = url.path.rsplit('/', 1)
new_path = head, 'flat.html'
print urlunsplit(url._replace(path=urljoin(*new_path)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM