简体   繁体   中英

Retrieve audio/mp3 file from URL and save to blobstore

I am trying to save a file (audio/mp3 in this case) to the App Engine blobstore, but with mixed success. Everything seems to work, a file is saved in the blobstore, of the right type, but it essentially empty (1.5kB vs. the expected 6.5kB) and so won't play. The URL in question is http://translate.google.com/translate_tts?ie=UTF-8&tl=en&q=revenues+in+new+york+were+56+million

The app engine logs do not show anything unusual - all parts are executing as expected... Any pointers would be appreciated!

class Dictation(webapp2.RequestHandler):
  def post(self):
    sentence = self.request.get('words')

    # Google Translate API cannot handle strings > 100 characters
    sentence = sentence[:100]

    # Replace the non-alphanumeric characters 
    # The spaces in the sentence are replaced with the Plus symbol
    sentence = urllib.urlencode({'q': sentence})

    # Name of the MP3 file generated using the MD5 hash
    mp3_file = hashlib.md5(sentence).hexdigest()

    # Save the MP3 file in this folder with the .mp3 extension
    mp3_file = mp3_file + ".mp3"

    # Create the full URL
    url = 'http://translate.google.com/translate_tts?ie=UTF-8&tl=en&' + sentence

    # upload to blobstore
    mp3_file = files.blobstore.create(mime_type = 'audio/mp3', _blobinfo_uploaded_filename = mp3_file)
    mp3 = urllib.urlopen(url).read()

    with files.open(mp3_file, 'a') as f:
      f.write(mp3)

    files.finalize(mp3_file)

    blob_key = files.blobstore.get_blob_key(mp3_file)
    logging.info('blob_key identified as %s', blob_key)

The problem has nothing to do with your code; it is correctly retrieving the data from the URL you gave.

For example, if I try this at the command line:

$ curl -O http://translate.google.com/translate_tts?ie=UTF-8&tl=en&q=revenues+in+new+york+were+56+million

I get a 1.5kB 403 error page, whose contents say:

403. That's an error.

Your client does not have permission to get URL /translate_tts?ie=UTF-8&tl=en&q=revenues+in+new+york+were+56+million from this server. (Client IP address: 1.2.3.4)

That's all we know.

And your code does the exact same thing, whether run in GAE or directly in the interactive interpreter.

Most likely, the reason it works in your browser is that you do have permissions. So, what does that mean? It could mean that you have a valid SID cookie from google.com in your browser, but not your script. Or it could mean that your browser's user agent is recognized as something that can play HTML5 audio, but your script's isn't. Or…

Well, you can try to reverse-engineer what's different in the cookies, headers, etc. between your browser and your script, and narrow it down to the relevant difference, and use explicit headers or cookies or whatever you need to work around the problem.

But it will just break the next time Google changes anything.

And Google will probably not be happy with you if you try this. They offer a Google Translate API service that they want you to use, and they got rid of all of the free options for that API because of "substantial economic burden caused by extensive abuse." Trying to publish a Google App Engine web service that evades Google's API pricing by scraping their pages is probably not the kind of thing they enjoy their customers doing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM