简体   繁体   中英

Having trouble splitting a string

I'm scraping some data from Google Translate like so:

import urllib
import mechanize

get_url=("https://translate.google.ie/translate_a/single?client=t&sl=auto&tl=es&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&source=btn&ssel=0&tsel=3&kc=0&tk=520887|911740&q=Hellow%20World")

browser=mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders=[('User-agent','Chrome')]

translate_text=urllib.urlopen(get_url).read()
print translate_text

Which gives me the following output:

[["Hellow Mundial", "Hellow World"]]
undefined
"en"
undefined
undefined
[["Hellow", 1,…], ["World", 2,…]]
0.022165652
undefined
[["en"], undefined, [0.022165652]]

Which can be seen here:

在此输入图像描述

So I try to split the data on the ]] so my output will only be:

[["Hellow Mundial", "Hellow World"]]

I'm splitting the data like so:

translate_text=translate_text.split("]]")
print translate_text[0]

However, when I run this I get the page markup. Before the split, I got the query result. How come the split is causing this and not splitting the string as intended?

Google is returning something similar to JSON format (but not actually JSON) for you which can be very easily parsed after a simple RegEx to replace any consecutive commas with a single one:

Try:

import json
import re

# replace any consecutive commas with a single one
translate_text = re.sub( ',+', ',', translate_text ).strip()
arr = json.loads(translate_text)
print arr[0][0][0] # prints "Hellow Mundial"
print arr[0][0][1] # prints "Hellow World"

Note that translate_text is a string, and arr is a Python array. json.loads was able to parse into a native Python format for you so that you could use simple List and Dictionary look-ups.

Those ]] you see are not a part of the actual string. they are placed there by Python to indicate that the stuff inside the [] and delimited by , are elements of an array.

In your case, the first element of the array is a 2D array whose first dimension only contains one element. That element is itself an array containing two strings.

If I understand your question correctly, you don't need to split anything at all. Try simply typing:

print translate_text[0]

without the split.

I think the string you want to use is in JSON format, I suggest you to parse it with json lib:

>>> import json
>>> json.loads('[["Hellow Mundial", "Hellow World"]]')
[[u'Hellow Mundial', u'Hellow World']]

The JSON will be translated into Python objects (currently here list of list):

>>> l = json.loads('[["Hellow Mundial", "Hellow World"]]')
>>> l[0]
["Hellow Mundial", "Hellow World"]
>>> l[0][0]
"Hellow Mundial"

You could extract the first list with a regex:

get_url=("https://translate.google.ie/translate_a/single?client=t&sl=auto&tl=es&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&source=btn&ssel=0&tsel=3&kc=0&tk=520887|911740&q=Hellow%20World")

import requests
r = requests.get(get_url)

import re

print(re.search("\[(\"(.*?)\")\]",r.content).group(1))

"Hello World como estas","Hello World how are you"

If you want the names in variables:

a ,b  = re.search("\[(\"(.*?)\")\]",r.text).group(1).split(",")
print(a,b)
"Hello World como estas" "Hello World how are you"

If you really want a list you can use ast.literal_eval after getting the first list with re:

import re
from ast import literal_eval
print(literal_eval(re.search("\[(\"(.*?)\")\]",r.text).group(0)))
['Hello World como estas', 'Hello World how are you']

If you run the code in your browser it actually downloads as a .txt file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM