简体   繁体   中英

How do I extract the data from the URL using Regex (Know the variable name)?

I am trying to extract data from a website https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited using Scrapy and Beautiful Soup. However, both scrapers return empty when I use the class 'list-nw' .

I tried different parsers using BS but the same. On closer look, I noticed the view source has the data I need. Thus I get the page content in text which has the data. (rather than the class).

How do I extract the entire array using Regex for the key "LstrationaleDetails" inside variable var Model . (Line number 793)?

I tried several Regex but was unable to. Is Regex the only option or I can use Scrapy or BS? Also confused as after extracting how will I store it? If it was a JSON I could de-serialize it. I was thinking of something in lines of split and eval .

I tried this for BS.

page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html5lib.parser')
print(soup)

Thanks for the help.

Attributable to @tmadam

You can use the following regex to extract from source html. Use the DOTALL flag to allow for newlines. User-Agent is required in headers.

import requests
import re
import json

url = 'https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited'
headers = {    
    'User-Agent' : 'Mozilla/5.0'
}
r = requests.get(url, headers = headers)
data = re.search('var Model =(.*?);\s+Ratinoal', r.text, flags=re.DOTALL).group(1)
result = json.loads(data)
for item in result['LstrationaleDetails']:
    print(item)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM