简体   繁体   English

如何使用Regex(知道变量名)从URL中提取数据?

[英]How do I extract the data from the URL using Regex (Know the variable name)?

I am trying to extract data from a website https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited using Scrapy and Beautiful Soup. 我正在尝试使用Scrapy and Beautiful Soup从网站https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited中提取数据。 However, both scrapers return empty when I use the class 'list-nw' . 但是,当我使用类'list-nw'时,两个刮板都返回空。

I tried different parsers using BS but the same. 我使用BS尝试了不同的解析器,但相同。 On closer look, I noticed the view source has the data I need. 仔细观察,我发现视图源包含我需要的数据。 Thus I get the page content in text which has the data. 因此,我得到具有数据的文本中的页面内容。 (rather than the class). (而不是课程)。

How do I extract the entire array using Regex for the key "LstrationaleDetails" inside variable var Model . 如何使用正则表达式在变量var Model内的键"LstrationaleDetails"提取整个数组。 (Line number 793)? (电话号码793)?

I tried several Regex but was unable to. 我尝试了几种Regex,但无法进行。 Is Regex the only option or I can use Scrapy or BS? 是Regex唯一的选择,还是我可以使用Scrapy或BS? Also confused as after extracting how will I store it? 也很困惑,因为提取后我将如何存储它? If it was a JSON I could de-serialize it. 如果它是JSON,则可以反序列化。 I was thinking of something in lines of split and eval . 我在想些spliteval的事情。

I tried this for BS. 我为BS尝试过这个。

page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html5lib.parser')
print(soup)

Thanks for the help. 谢谢您的帮助。

Attributable to @tmadam 归属于@tmadam

You can use the following regex to extract from source html. 您可以使用以下正则表达式从源html中提取。 Use the DOTALL flag to allow for newlines. 使用DOTALL标志允许换行符。 User-Agent is required in headers. 标头中需要User-Agent。

import requests
import re
import json

url = 'https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited'
headers = {    
    'User-Agent' : 'Mozilla/5.0'
}
r = requests.get(url, headers = headers)
data = re.search('var Model =(.*?);\s+Ratinoal', r.text, flags=re.DOTALL).group(1)
result = json.loads(data)
for item in result['LstrationaleDetails']:
    print(item)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 python 的正则表达式从 URL 中提取域名 - Extract domain name from URL using python's re regex 如何从字符串中提取网址数据 - how do I extract the Url data from my string 如何在 Python 中使用正则表达式从重定向 URL 中提取 URL? - How to extract URL from a redirect URL using regex in Python? 网页抓取 page_soup.findAll 我需要从网页中提取特定数据但不知道该怎么做 - Webscraping page_soup.findAll i need to extract especific data from a webpage but dont know how to do it 使用Python,如何从具有多个可变长度记录的二进制数据文件中读取和提取数据? - Using Python, how do I read and extract data from a binary data file with multiple variable-length records? 如何使用python正则表达式提取具有可变内容的数据? - How can I use python regex to extract data with variable content? 如何使用正则表达式从 DataFrame 中提取数据? - How do I extract data from a DataFrame using regular expressions? 如何使用python从xml提取特定数据? - How do I extract specific data from xml using python? 如何使用 Python 中的正则表达式从 url 中提取某些模式? - How to extract certain pattern from a url using regex in Python? 我想从 python 3 中的字符串中提取所有十进制数,如何在不使用正则表达式的情况下做到这一点? - I want to extract all decimal numbers from a string in python 3, how can I do that without using regex?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM