从 EDGAR 下载一个 txt 文件

Question

I want to download this file to my local drive: https://www.sec.gov/Archives/edgar/data/1556179/0001104659-20-000861.txt我想将此文件下载到我的本地驱动器： https://www.sec.gov/Archives/edgar/data/1556179/0001104659-20-000861.txt

Here are my codes:这是我的代码：

import requests
import urllib
from bs4 import BeautifulSoup
import re
  
path=r"https://www.sec.gov/Archives/edgar/data/1556179/0001104659-20-000861.txt" 
r=requests.get(path, headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
soup=str(soup)
lines=soup.split("\\n")

dest_url=r"C://Users/YL/Downloads/a.txt"
fx=open(dest_url,'w')
for line in lines:
    fx.write(line + '\n')

Here is the error message:这是错误消息：

How should I download the file then?那我应该怎么下载文件呢？ Thanks a lot!非常感谢！

Answer 1

Your file has downloaded alright;您的文件已下载正常； it seems there's a problem with BeautifulSoup's parsing. BeautifulSoup 的解析似乎有问题。 If you change the parser, instead of如果您更改解析器，而不是

content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
soup=str(soup)

use利用

stuff=r.text
soup=BeautifulSoup(stuff, "html.parser")
soup

and you'll see the file is there.你会看到文件在那里。

Answer 2

The download is fine.下载很好。 The problem is that str(soup) is not well-defined, and throws html5lib into an endless loop.问题是str(soup)没有明确定义，并将html5lib抛出一个无限循环。 You probably meant你可能是说

soup = soup.text

which (crudely) extracts the actual readable text from the BeatifulSoup object.它（粗略地）从 BeatifulSoup object 中提取实际可读文本。

从 EDGAR 下载一个 txt 文件

问题描述

2 个解决方案

解决方案1
0 2022-01-16 16:23:18

解决方案2
0 2022-01-17 12:30:40

从 EDGAR 下载一个 txt 文件

问题描述

2 个解决方案

解决方案1 0 2022-01-16 16:23:18

解决方案2 0 2022-01-17 12:30:40

解决方案1
0 2022-01-16 16:23:18

解决方案2
0 2022-01-17 12:30:40