简体   繁体   English

我将如何从链接列表中获取信息,然后将其转储到JSON对象中?

[英]How would I go about getting info from a list of link then dump them into a JSON object?

New to Python and BeautifulSoup. Python和BeautifulSoup的新功能。 Any help is highly appreciated 任何帮助都受到高度赞赏

I have an idea of how to build one list of a companies info, but that's after clicking on one link. 我对如何建立的公司信息一个列表的想法,但点击一个链接后的。

import requests 
from bs4 import BeautifulSoup


url = "http://data-interview.enigmalabs.org/companies/"
r = requests.get(url)

soup = BeautifulSoup(r.content)

links = soup.find_all("a")

link_list = []

 for link in links:
    print link.get("href"), link.text

 g_data = soup.find_all("div",{"class": "table-responsive"})

 for link in links:
    print link_list.append(link)

Can anyone give an idea of how to go about first scraping the links then building a JSON of all of the company listings data for the site? 谁能提出一个想法,首先去抓取链接,然后为站点的所有公司列表数据构建一个JSON?

I attached sample images for a better visualization as well. 我还附加了示例图像,以实现更好的可视化效果。

How would I scrape the site and build a JSON like my example below without having to click on each individual link? 我如何刮取网站并像下面的示例一样构建JSON,而不必单击每个单独的链接?

Example Expected Output: 预期输出示例:

all_listing = [ {"Dickens-Tillman":{'Company Detail': 
 {'Company Name': 'Dickens-Tillman',
  'Address Line 1   ': '7147 Guilford Turnpike Suit816',
  'Address Line 2   ': 'Suite 708',
  'City': 'Connfurt',
  'State': 'Iowa',
  'Zipcode  ': '22598',
  'Phone': '00866539483',
  'Company Website  ': 'lockman.com',
  'Company Description': 'enable robust paradigms'}}},
`{'"Klein-Powlowski" ':{'Company Detail': 
 {'Company Name': 'Klein-Powlowski',
  'Address Line 1   ': '32746 Gaylord Harbors',
  'Address Line 2   ': 'Suite 866',
  'City': 'Lake Mario',
  'State': 'Kentucky',
  'Zipcode  ': '45517',
  'Phone': '1-299-479-5649',
  'Company Website  ': 'marquardt.biz',
 'Company Description': 'monetize scalable paradigms'}}}]

print all_listing`

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

Here is my final solution to the question I asked. 这是我提出的问题的最终解决方案。

import bs4, urlparse, json, requests,csv
from os.path import basename as bn

links = []
data = {}
base = 'http://data-interview.enigmalabs.org/'

#Approach 
#1. Each individual pages, collect the links
#2. Iterate over each link in a list
#3. Before moving on each the list for links if correct move on, if not review step 2 then 1
#4. Push correct data to a JSON file



def bs(r):
    return bs4.BeautifulSoup(requests.get(urlparse.urljoin(base, r).encode()).content, 'html.parser').find('table')

for i in range(1,11):
    print 'Collecting page %d' % i
    links += [a['href'] for a in bs('companies?page=%d' % i).findAll('a')]
# Search a the given range of "a" on each page

# Now that I have collected all links into an list,iterate over each link
# All the info is within a html table, so search and collect all company info in data
for link in links:
    print 'Processing %s' % link
    name = bn(link)
    data[name] = {}
    for row in bs(link).findAll('tr'):
        desc, cont = row.findAll('td')
        data[name][desc.text.encode()] = cont.text.encode()

print json.dumps(data)

# Final step is to have all data formating 
json_data = json.dumps(data, indent=4)
file = open("solution.json","w")
file.write(json_data)
file.close()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Python从https链接获取JSON? - How can I go about getting the JSON from a https link using Python? 我该如何从提交的链接下载文件,然后重新上传到服务器进行流传输? - How would I go about downloading a file from a submitted link then reuploading to my server for streaming? 我将如何显示总数以显示实际总数? - How would i go about getting the total to show the actual total? Python-我将如何从 html 文档中获取文本块 - Python-How would i go about getting block of text from an html document 我将如何 go 关于从网站上抓取数据并在保存旧数据的同时每天使用新信息更新文件? - How would I go about scraping data from a website and updating a file with the new info each day while saving older data? 我将如何打破这份清单 - How would I go about breaking up this list Tkinter-我将如何重置对象列表? - Tkinter - How would I go about resetting list of objects? 我将如何从跳过一个值的列表中提取坐标? - How would I go about extracting coordinates from a list skipping one value? 我将如何创建一个具有 aa 列的唯一值并对其进行计数的新数据框? - How would I go about creating a new data frame that has the unique values of a a column and it counts them? 我 go 如何使用列表(元素)function? 你将如何使用这个 function? - How would I go about using the list(element) function? How would you use this function?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM