简体   繁体   English

如何从Javascript代码中提取URL? -Python

[英]How can I extract URLs from within Javascript code? - Python

A site of mine went offline a while ago and I need to recover the images. 我的一个站点前一阵子离线了,我需要恢复图像。 I've managed to write some python that extracts the code from a script tag with Beautiful Soup. 我设法编写了一些Python,可以使用Beautiful Soup从脚本标签中提取代码。 I now need to parse some urls from the extracted text. 现在,我需要从提取的文本中解析一些网址。 The urls needed relates to the "large" image. 所需的网址与"large"图像有关。 I'm unsure how to incorporate the loop for all images and not just the first and remove the speech marks. 我不确定如何合并所有图像的循环,而不仅仅是第一幅图像并去除语音标记。 Any help would be greatly appreciated 任何帮助将不胜感激

Extracted Text: 提取的文字:

var gallery_items = [{
    "type": "image",
    "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg",
    "medium-height": 267,
    "medium-width": 400,
    "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg",
    "large-height": 450,
    "large-width": 675,
    "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg",
    "caption": ""
}, {
    "type": "image",
    "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg",
    "medium-height": 267,
    "medium-width": 400,
    "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg",
    "large-height": 450,
    "large-width": 675,
    "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg",
    "caption": ""
}];

Python Script Python脚本

from bs4 import BeautifulSoup
import urllib.request as request
import re

folder = r'./gallery'
URL = 'https://web.archive.org/web/20180324152250/http://www.example.com:80/project/test-museum-visitors-center/'
response = request.urlopen(URL)
soup = BeautifulSoup(response, 'html.parser')

scriptCnt = soup.find('div', {'class': 'posts-wrapper'})
script = scriptCnt.find('script').text

try:
    found = re.search('"large":(.+?)"', script).group(1)
except AttributeError:
    found = 'None Found!'


print(found)

Output 产量

"https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg

The given data is in JSON format which will be easy to parse with Python's JSON library. 给定的数据为JSON格式,可轻松使用Python的JSON库进行解析。 All you need to do is to extract the JSON alone carefully and to supply to the JSON parser. 您需要做的就是仔细地单独提取JSON并将其提供给JSON解析器。 The code might look something like, 该代码可能类似于

import json
script_str = '''var gallery_items = [{ "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg", "caption": "" }, { "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg", "caption": "" }];'''
json_str = script_str[str(script_str).find('var gallery_items = '):str(script_str).find(';')].replace('var gallery_items = ', '')
json_str = json.loads(json_str)
for item in json_str:
    print(item['large'])

Hope this helps! 希望这可以帮助! Cheers! 干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM