I am trying to extract data from a website https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited using Scrapy and Beautiful Soup. However, both scrapers return empty when I use the class 'list-nw'
.
I tried different parsers using BS but the same. On closer look, I noticed the view source has the data I need. Thus I get the page content in text which has the data. (rather than the class).
How do I extract the entire array using Regex for the key "LstrationaleDetails"
inside variable var Model
. (Line number 793)?
I tried several Regex but was unable to. Is Regex the only option or I can use Scrapy or BS? Also confused as after extracting how will I store it? If it was a JSON I could de-serialize it. I was thinking of something in lines of split
and eval
.
I tried this for BS.
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html5lib.parser')
print(soup)
Thanks for the help.
Attributable to @tmadam
You can use the following regex to extract from source html. Use the DOTALL flag to allow for newlines. User-Agent is required in headers.
import requests
import re
import json
url = 'https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited'
headers = {
'User-Agent' : 'Mozilla/5.0'
}
r = requests.get(url, headers = headers)
data = re.search('var Model =(.*?);\s+Ratinoal', r.text, flags=re.DOTALL).group(1)
result = json.loads(data)
for item in result['LstrationaleDetails']:
print(item)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.