网络抓取后无法从字典中检索值

Question

I was hoping people on here would be able to answer what I believe to be a simple question.我希望这里的人能够回答我认为是一个简单的问题。 I'm a complete newbie and have been attempting to create an image webscraper from the site Archdaily.我是一个完全的新手，一直在尝试从网站 Archdaily 创建一个图像网络爬虫。 Below is my code so far after numerous attempts to debug it:经过多次调试后，下面是我的代码：

#### - Webscraping 0.1 alpha -
#### - Archdaily - 

import requests
from bs4 import BeautifulSoup

# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'

# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content

# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for k, v in img_list():
    if k == 'url_large':
        print(v)

These elements here:这些元素在这里：

img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']

Attempts to isolate the data-images attribute, shown here:尝试隔离 data-images 属性，如下所示：

My github upload of this portion, very long 我github上传了这部分，很长

As you can see, or maybe I'm completely wrong here, my attempts to call the 'url_large' values from this final dictionary list comes to a TypeError, shown below:如您所见，或者我在这里完全错了，我尝试从此最终字典列表中调用“url_large”值时出现了 TypeError，如下所示：

Traceback (most recent call last):
  File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 23, in <module>
    for k, v in img_list():
TypeError: 'str' object is not callable

I believe my error lies in the resulting isolation of 'data-images', which to me looks like a dict within a list, as they're wrapped by brackets and curly braces.我认为我的错误在于由此产生的“数据图像”隔离，对我来说它看起来像列表中的字典，因为它们被方括号和大括号括起来。 I'm completely out of my element here because I basically jumped into this project blind (haven't even read past chapter 4 of Guttag's book yet).我在这里完全不适应，因为我基本上是盲目地进入这个项目的（甚至还没有读过 Guttag 的书的第 4 章）。

I also looked everywhere for ideas and tried to mimic what I found.我也到处寻找想法，并试图模仿我发现的东西。 I've found solutions others have offered previously to change the data to JSON data, so I found the code below:我找到了其他人之前提供的将数据更改为 JSON 数据的解决方案，因此我找到了以下代码：

jsonData = json.loads(img.attrs['data-images'])
print(jsonData['url_large'])

But that was a bust, shown here:但这是一个半身像，如下所示：

Traceback (most recent call last):
  File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 29, in <module>
    print(jsonData['url_large'])
TypeError: list indices must be integers or slices, not str

There is a step I'm missing here in changing these string values, but I'm not sure where I could change them.在更改这些字符串值时我缺少一个步骤，但我不确定在哪里可以更改它们。 I'm hoping someone can help me resolve this issue, thanks!希望有人能帮我解决这个问题，谢谢！

Answer 1

It's all about the types.这都是关于类型的。

img_list is actually not a list, but a string. img_list其实不是一个列表，而是一个字符串。 You try to call it by img_list() which results in an error.您尝试通过img_list()调用它，这会导致错误。

You had the right idea of turning it into a dictionary using json.loads .您有使用json.loads将其变成字典的正确想法。 The error here is pretty straight forward - jsonData is a list, not a dictionary.这里的错误非常简单jsonData是一个列表，而不是字典。 You have more than one image.你有不止一张图片。

You can loop through the list.您可以遍历列表。 Each item in the list is a dictionary, and you'll be able to find the url_large attribute in each dictionary in the list:列表中的每个项目都是一个字典，您将能够在列表中的每个字典中找到url_large属性：

images_json = img.attrs['data-images']
for image_properties in json.loads(images_json):
    print(image_properties['url_large'])

Answer 2

@infinity & @simic0de are both right, but I wanted to more explicitly address what I see in your code as well. @infinity 和@simic0de 都是对的，但我也想更明确地说明我在您的代码中看到的内容。

In this particular block:在这个特定的块中：

img_list = img.attrs['data-images'] for k, v in img_list(): if k == 'url_large': print(v)

There is a couple syntax errors.有几个语法错误。 If 'img_list' truly WAS a dictionary, you cannot iterate through it this way.如果“img_list”真的是一本字典，你就不能用这种方式遍历它。 You would need to use img_list.items() (for python3) or img_list.iteritems() (python2) in the second line.您需要在第二行使用 img_list.items() （对于 python3）或 img_list.iteritems() （python2）。

When you use the parenthesis like that, it implies that you're calling a function. But here, you're trying to iterate through a dictionary.当你像那样使用括号时，意味着你正在调用 function。但是在这里，你正在尝试遍历字典。 That is why you get the 'is not callable' error.这就是为什么您会收到“不可调用”错误的原因。

The other main issue is the Type issue.另一个主要问题是类型问题。 simic0de & Infinity address that, but ultimately you need to check the type of img_list and convert it as needed so you can iterate through it. simic0de 和 Infinity 解决了这个问题，但最终您需要检查 img_list 的类型并根据需要进行转换，以便您可以遍历它。

Answer 3

Source of error: img_list is a string.错误来源： img_list是一个字符串。 You have to convert it to list using json.loads and it not becomes a list of dicts that you have to loop over.您必须使用json.loads将其转换为列表，并且它不会成为您必须循环的字典列表。

Working Solution:工作解决方案：

import json
import requests
from bs4 import BeautifulSoup

# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'

# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content

# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for img in json.loads(img_list):
    for k, v in img.items():
        if k == 'url_large':
            print(v)

网络抓取后无法从字典中检索值

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-06-10 02:22:12

解决方案2
0 2020-06-10 12:24:18

解决方案3
-1 2020-06-10 02:23:33

网络抓取后无法从字典中检索值

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-06-10 02:22:12

解决方案2 0 2020-06-10 12:24:18

解决方案3 -1 2020-06-10 02:23:33

解决方案1
1 已采纳 2020-06-10 02:22:12

解决方案2
0 2020-06-10 12:24:18

解决方案3
-1 2020-06-10 02:23:33