繁体   English   中英

使用 beautifulsoup 抓取 url 列表并将数据转换为 csv

[英]Scraping a list of urls using beautifulsoup and convert data to csv

我是 Python 的新手。 以下问题:

  1. 我有一个要从中抓取数据的 url 列表。 我不知道我的代码有什么问题,我无法从所有 url 中检索结果。 该代码仅抓取第一个 url 而不是 rest。 如何成功抓取列表中所有 url 中的数据(标题、信息、描述、应用程序)?

  2. 如果问题 1 有效,如何将数据转换为 CSV 文件?

这是代码:

import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

urlList = ["url1","url2","url3"...lists of urls.......]

for url in urlList:
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
    except URLError:
        print("error")
    else:
        soup = BeautifulSoup(html.read(),"html5lib")
# Scraping
def getTitle():
    for title in soup.find('h2', class_="xx").text:
            print(title)

def getInfo():
   for info in soup.find('ul', class_="j-k-i").text:
        print(info)

def getDescription():
    for description in soup.find('div', class_="b-d").text:
        print(description)

def getApplication():
    for application in soup.find('div', class_="g-b bm-b-30").text:
       print(application)

for soups in soup():
    getTitle()
    getInfo()
    getDescription()
    getApplication()

尝试以下方法:

import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError


def getTitle(soup):
    for title in soup.find('h2', class_="xx").text:
            print(title)

def getInfo(soup):
    for info in soup.find('ul', class_="j-k-i").text:
        print(info)

def getDescription(soup):
    for description in soup.find('div', class_="b-d").text:
        print(description)

def getApplication(soup):
    for application in soup.find('div', class_="g-b bm-b-30").text:
       print(application)

urlList = ["url1","url2","url3"...lists of urls.......]

for url in urlList:
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
    except URLError:
        print("error")
    else:
        soup = BeautifulSoup(html.read(),"html5lib")

        getTitle(soup)
        getInfo(soup)
        getDescription(soup)
        getApplication(soup)

这会将当前的soup传递给每个 function 使用。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM