简体   繁体   中英

Webscraping page_soup.findAll i need to extract especific data from a webpage but dont know how to do it

i am trying to do some webscraping and i need to extract the keywords from a webpage. I am trying to use page_soup.findAll() to extract it but i dont know what to insert between () to extract what i need.

The code of the page is the following:

var kv = {"seccion": "otros","nivel": "home","nota": "","id_nota": "","tipo": "noticias","keywords" : "IMPUESTOS,  SII,  EXCEDENTES ISAPRES,  INCENDIOS,  COLUSION CONFORT,  COMPENSACION,  PERMISOS DE CIRCULACION,  REVISION TECNICA"};

And i need these data:

"IMPUESTOS, SII, EXCEDENTES ISAPRES, INCENDIOS, COLUSION CONFORT, COMPENSACION, PERMISOS DE CIRCULACION, REVISION TECNICA"

Thanks

This is not HTML but JavaScript so findaAll() is useless for this.

You have it as string so use string functions to get it - ie. slicing [start:end] , split() , replace() , etc.

OR you can remove from this string var kv = and ; and you will have JSON string which you can convert to Python's dictionary using module json and then you can get it from dictionary - dictionary["keywords"]

text = 'var kv = {"seccion": "otros","nivel": "home","nota": "","id_nota": "","tipo": "noticias","keywords" : "IMPUESTOS,  SII,  EXCEDENTES ISAPRES,  INCENDIOS,  COLUSION CONFORT,  COMPENSACION,  PERMISOS DE CIRCULACION,  REVISION TECNICA"};'

text = text[9:-1]  # remove `var kv = ` and `;`

import json

d = json.loads(text)

print(d['keywords'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM