簡體   English   中英

如何從文本文件的 URL 中獲取特定的文本塊?

[英]How to get a specific chunk of text from a URL of a text file?

for i in range(len(file)) :
a = file.loc[i, "SECFNAME"]
url = ('https://www.sec.gov/Archives/' + a)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
txt = str(soup)
text = txt.lower()
doc_lenght = len(text)

for line in urllib.request.urlopen(url):
    print(line.decode('utf-8'))
    def mdaa(text, doc_lenght):
        if elem in text.find("ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS "):
            print (elem)
        else :
            pass

該鏈接在下面有一個稱為管理的討論和分析的部分,它有它的描述或需要刪除的一段文本。 從上面的代碼我只能打印整個文檔而不是那個特定的部分。

在數據集中需要對給出 URL 的數據集(文件)中的每一行值進行處理。 因此,在 Python 中,當給定文本文件的 URL 時,訪問文本文件內容並在本地逐行打印文件內容而不保存文本文件的本地副本的最簡單方法是什么?

您可以使用Pandas來讀取您的.xlsx.csv文件,並在SECFNAME列上使用apply function。 使用request庫來獲取文本並避免將文本的本地副本保存到文件中。 應用類似於您在查找 function 中已經使用的文本的正則表達式,這里需要注意的是必須存在ITEM 8 從這里您可以打印到屏幕或保存到文件。 根據我的檢查,並非所有文本鏈接都有ITEM 7 ,這就是列表中的某些項目返回None的原因。

import pandas as pd
import requests
import re

URL_PREFIX = "https://www.sec.gov/Archives/"
REGEX = r"\nITEM 7\.\s*MANAGEMENT'S DISCUSSION AND ANALYSIS.*?(?=\nITEM 8\.\s)"

def get_section(url):
    source = requests.get(f'{URL_PREFIX}/{url}').text

    r = re.findall(REGEX, source, re.M | re.DOTALL)
    if r:
        return ''.join(r)

df['has_ITEM7'] = df.SECFNAME.apply(get_section)

hasITEM7_list = df['has_ITEM7'].to_list()

來自 hasITEM7_list 的Output

['\nITEM 7. MANAGEMENT\'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS\n        OF OPERATION\n\n\nYEAR ENDED DECEMBER 28, 1997 COMPARED TO THE YEAR ENDED DECEMBER 29, 1996\n\n\n     In November 1996, the Company initiated a major restructuring and growth\nplan designed to substantially reduce its cost structure and grow the business\nin order to restore higher levels of profitability for the Company. By July\n1997, the Company completed the major phases of the restructuring plan. The\n$225.0 million of annualized cost savings anticipated from the restructuring\nresults primarily from the consolidation of administrative functions within the\nCompany, the rationalization
...
...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM