简体   繁体   English

无法使用 openpyxl 在 excel 文件中写入 html 内容

[英]Unable to write html content in an excel file using openpyxl

I've created a tiny script in python to scrape the first title and it's description from a website and write the same in an excel file using openpyxl library.我在python中创建了一个openpyxl从网站上抓取第一个标题和它的描述,并使用openpyxl库在excel文件中编写相同的openpyxl The important thing to notice here is that I wish to save the title as text but the description as raw html content, not text.这里要注意的重要一点是,我希望将标题保存为文本,但将描述保存为原始 html 内容,而不是文本。

I've tried like:我试过这样:

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook

link = "https://stackoverflow.com/questions/tagged/web-scraping"
wb = Workbook()
wb.remove(wb['Sheet'])

def fetch_content(link):
    req = requests.get(link)
    soup = BeautifulSoup(req.text,"lxml")
    title = soup.select_one("#questions .summary .question-hyperlink").get_text(strip=True)
    desc = soup.select_one("#questions .summary")

    ws.append([title,desc])
    print(title,desc)

if __name__ == '__main__':
    ws = wb.create_sheet("output")
    ws.append(['Title','Description'])
    fetch_content(link)
    wb.save("SO.xlsx")

When I run the script, I get the following error:当我运行脚本时,出现以下错误:

raise ValueError("Cannot convert {0!r} to Excel".format(value))
ValueError: Cannot convert <div class="summary"> -----so on

Expected output in that excel file (both truncated):该 excel 文件中的预期输出(均被截断):

How to scrape data   <div class="summary">

stovfl and robot.txt made the perfect solution. stovfl 和robot.txt 是完美的解决方案。 I took the liberty to post the answer since I often forget this approach.我冒昧地发布了答案,因为我经常忘记这种方法。

def fetch_content(link):
    req = requests.get(link)
    soup = BeautifulSoup(req.text,"lxml")
    title = soup.select_one("#questions .summary .question-  hyperlink").get_text(strip=True)
    desc = soup.select_one("#questions .summary")

    ws.append([title,str(desc)]) #cast desc to str
    print(title,desc)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM