[英]How to get a specific chunk of text from a URL of a text file?
for i in range(len(file)) :
a = file.loc[i, "SECFNAME"]
url = ('https://www.sec.gov/Archives/' + a)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
txt = str(soup)
text = txt.lower()
doc_lenght = len(text)
for line in urllib.request.urlopen(url):
print(line.decode('utf-8'))
def mdaa(text, doc_lenght):
if elem in text.find("ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS "):
print (elem)
else :
pass
該鏈接在下面有一個稱為管理的討論和分析的部分,它有它的描述或需要刪除的一段文本。 從上面的代碼我只能打印整個文檔而不是那個特定的部分。
在數據集中需要對給出 URL 的數據集(文件)中的每一行值進行處理。 因此,在 Python 中,當給定文本文件的 URL 時,訪問文本文件內容並在本地逐行打印文件內容而不保存文本文件的本地副本的最簡單方法是什么?
您可以使用Pandas
來讀取您的.xlsx
或.csv
文件,並在SECFNAME
列上使用apply
function。 使用request
庫來獲取文本並避免將文本的本地副本保存到文件中。 應用類似於您在查找 function 中已經使用的文本的正則表達式,這里需要注意的是必須存在ITEM 8
。 從這里您可以打印到屏幕或保存到文件。 根據我的檢查,並非所有文本鏈接都有ITEM 7
,這就是列表中的某些項目返回None
的原因。
import pandas as pd
import requests
import re
URL_PREFIX = "https://www.sec.gov/Archives/"
REGEX = r"\nITEM 7\.\s*MANAGEMENT'S DISCUSSION AND ANALYSIS.*?(?=\nITEM 8\.\s)"
def get_section(url):
source = requests.get(f'{URL_PREFIX}/{url}').text
r = re.findall(REGEX, source, re.M | re.DOTALL)
if r:
return ''.join(r)
df['has_ITEM7'] = df.SECFNAME.apply(get_section)
hasITEM7_list = df['has_ITEM7'].to_list()
來自 hasITEM7_list 的Output
['\nITEM 7. MANAGEMENT\'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS\n OF OPERATION\n\n\nYEAR ENDED DECEMBER 28, 1997 COMPARED TO THE YEAR ENDED DECEMBER 29, 1996\n\n\n In November 1996, the Company initiated a major restructuring and growth\nplan designed to substantially reduce its cost structure and grow the business\nin order to restore higher levels of profitability for the Company. By July\n1997, the Company completed the major phases of the restructuring plan. The\n$225.0 million of annualized cost savings anticipated from the restructuring\nresults primarily from the consolidation of administrative functions within the\nCompany, the rationalization
...
...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.