在 Python 中使用 URL 从 txt 中提取数据

Question

I have a URL which contains txt data From this URL I want to extract a particular section of data:我有一个包含 txt 数据的URL 我想从这个 URL 中提取特定的数据部分：

data here:数据在这里：

I have added the screenshot of the txt file.我已经添加了txt文件的截图。 In the screenshot you can see "Table of Contents".在屏幕截图中，您可以看到“目录”。 From these table of contents I want to extract the textual data of a particular Item number for example I want to extract data from a Part 2 Item 5 which have data on page number 12. Can anyone help me in extracting this particular data using python从这些目录中，我想提取特定项目编号的文本数据，例如我想从第 2 部分第 5 部分中提取数据，该数据在第 12 页上有数据。任何人都可以帮助我使用 python 提取此特定数据

Answer 1

There are a few ways you could go about this.有几种方法可以解决这个问题。 The first, and probably the most simplistic, would be the string.find() method.第一个，也可能是最简单的，是string.find()方法。 Of course, that assumes that you know what you are looking for, you just want to use a program to fetch it instead of manually.当然，这假设您知道要查找的内容，您只想使用程序而不是手动获取它。

However, after looking at the way the document is formatted, there's the possibility that you could pass it to something like BeautifulSoup , although I really wouldn't recommend feeding that type of document into BS4, you'd likely get a large number of errors.但是，在查看了文档的格式化方式之后，您有可能将其传递给BeautifulSoup 之类的内容，尽管我真的不建议将这种类型的文档输入 BS4，但您可能会收到大量错误.

The third option is possibly the easiest method (as far as I can tell) which would be to construct a regex to search the document for a string matching what you want.第三个选项可能是最简单的方法（据我所知），它是构造一个正则表达式来搜索文档以查找与您想要的字符串匹配的字符串。

In your given example of Part 2, Item 5, you could write a simple program that would look like the following:在第 2 部分第 5 项的给定示例中，您可以编写一个如下所示的简单程序：

import re
import requests

r = requests.get("https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt")

str_to_check = r.text

what_i_want = re.findall(r"(?i)(Part 2,? Item 5)", str_to_check)

print(what_i_want)

A site like regex101 could be very helpful in learning how to construct the regex you would need.像regex101这样的网站对于学习如何构建您需要的正则表达式非常有帮助。

Answer 2

You can split your text into pages, and print the content of the page that you want:您可以将文本拆分为页面，并打印所需页面的内容：

import requests

r = requests.get("https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt")

r_text = r.text

text_pages = r_text.split("<PAGE>")

# get content of page 12
page = 12
text_page_12 = text_pages[page+1]
print(text_page_12)

Answer 3

Get Table data with Requests and BeautifulSoup then you can do other operations like save it to txt file or anything else.使用 Requests 和 BeautifulSoup 获取表数据，然后您可以执行其他操作，例如将其保存到 txt 文件或其他任何内容。

# https://stackoverflow.com/questions/64618978/extracting-data-from-txt-using-url-in-python
import requests
from bs4 import BeautifulSoup


def get_data(url):
    r =  requests.get(url)
    if r.status_code == 200:
        return r.content


url = "https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt"
data = get_data(url)

soup =  BeautifulSoup(data, "lxml")

table = soup.find("table")
print(table)

Answer 4

This code outputs the desired table in the terminal and also puts that table into a csv file (table_output.csv).此代码在终端中输出所需的表，并将该表放入 csv 文件 (table_output.csv)。 You can select the PART, ITEM and the TABLE (in case there is more than one table in the section, like for example, in PART I, Item 2.) Or you can select text with PART and ITEM.您可以选择 PART、ITEM 和 TABLE（如果该部分中有多个表格，例如，在 PART I，Item 2。）或者您可以选择带有 PART 和 ITEM 的文本。

Note: For the CSV file: you have to do a bit a work yourself (see the code) because how you group elements is not necessarily unique and I don't see how to do that completely mechanically for any table.注意：对于 CSV 文件：您必须自己做一些工作（请参阅代码），因为您对元素进行分组的方式不一定是唯一的，而且我不知道如何对任何表完全机械地执行此操作。

The code is not heavily commented but I hope that's enough.代码没有大量注释，但我希望这已经足够了。 Don't hesitate if you have questions.如果您有任何疑问，请不要犹豫。

Best regards,此致，

Stéphane斯蒂芬妮

The code:编码：

import requests
import re

def Encode_digit(digit, one, five, nine):
# from https://codereview.stackexchange.com/questions/141402/converting-from-roman-numerals-to-arabic-numbers

    return (
        nine                     if digit == 9 else
        five + one * (digit - 5) if digit >= 5 else
        one + five               if digit == 4 else
        one * digit              
    )

def Encode_roman_numeral(num): 
# from same source as above
    num = int(num)
    return (
        'M' * (num // 1000) +
        Encode_digit((num // 100) % 10, 'C', 'D', 'CM') +
        Encode_digit((num //  10) % 10, 'X', 'L', 'XC') +
        Encode_digit( num         % 10, 'I', 'V', 'IX') 
    )

def LookForPart(r_text, Part):
    Part=Encode_roman_numeral(Part)
    l_tags=[]
    s='PART '+Part
    for i in range(0,len(r_text)):#len(r_text)):
        if re.search('PART '+Part+'$',r_text[i]) and r_text[i].find('    ')>-1:
            l_tags.append(i)
    return l_tags

def LookForItem(r_text, Item):
    l_tags=[]
    s='ITEM '+str(Item)+'.'
    for i in range(0,len(r_text)):#len(r_text)):
        if re.search(s,r_text[i]):# and r_text[i].find('    ')>-1:
            l_tags.append(i)
    return l_tags

def SelectSection(r_text,Part,Item):
    start=LookForPart(r_text,Part)[0]
    try:
        end=LookForPart(r_text,Part+1)[0]
    except:
        end=len(r_text)
    r_text=r_text[start:end]
    start=LookForItem(r_text,Item)[0]
    try:
        end=LookForItem(r_text,Item+1)[0]
    except:
        end=len(r_text)
    r_text=r_text[start:end]
    return r_text

def RawTable(Section):
    tables=[]
    nb_tables=0
    for i in range(0,len(Section)):
        if Section[i].find('<TABLE>')>-1:
            Start=i
        if Section[i].find('</TABLE>')>-1:
            End=i
            nb_tables+=1
            tables.append(Section[Start:End+1])

    return tables

def get_table(r_text, Part, Item, Table_Nb):
    Section=SelectSection(r_text,Part,Item)
    table=[]
    ThisTable=RawTable(Section)[Table_Nb]
    for i in range(len(ThisTable)):
        line=ThisTable[i].replace('<S>','   ').replace('<C>','   ')              .replace('<TABLE>','')              .replace('</TABLE>','')              .replace('<CAPTION>','')
        line=re.sub('-','  ',line)
        delta=len(line)-len(re.sub('(\.+){3,}','',line))
        line=re.sub('(\.+){3,}',' '*delta,line)
        if len(line)>0:
            #print(line,file=f)
            if len(line.strip(' '))>0:
                table.append(line)    
    return table

##################################################################################
#
# Main:
# 
##################################################################################

# Retrieve data:

url='https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt'
r = requests.get(url)
r_text=r.text
r_text=r_text.splitlines()


# Get table in Part II, Item 5, first table (first table in case there are more than one table in the PART II Item 5. section) 

Result=get_table(r_text,2,5,0) # 2,5,0 = Part II, Item 5, first table
print("<Extracted table>")
for i in range(0,len(Result)):
    print(Result[i])
print('<End of table>')


# The rest really depends on the intrinsic variablity of the data. If we assume that this format is strictly
# valid in every case, then we can extract the data as follows:

# Let's output the result to a comma separatred csv file:

import csv

Result=Result[2:]; # We skip the first two rows


fields=['DATE','$ HIGH','FRAC_HIGH','$ LOW','FRAC_LOW'] 
dump=[]
dump.append(fields)
for i in range(0,len(Result)):
    dump.append([Result[i][8:22],Result[i][30:38].strip(' '),Result[i][38:44].strip(' '),                 Result[i][45:51].strip(' '),Result[i][51:].strip(' ')])

f=open('table_output.csv', 'w')
write = csv.writer(f) 
for row in dump:   
    write.writerow(row) 
f.close()

# Another example: Select PART II, Item 5.
#
# Example to select PART 2 Item 5.
print()
print('<Selection: PART 2 Item 5.>')
sel=SelectSection(r_text, 2,5)
for i in range(0, len(sel)):
    print(sel[i])

The output from terminal:终端的输出：

<Extracted table>
                                         MARKET PRICE
                                       HIGH          LOW
        1997:
        Fourth Quarter             $50 7/16     $37 
        Third Quarter              $45 3/4      $35 3/8
        Second Quarter             $40 3/4      $29 3/4
        First Quarter              $34 1/2      $24 5/8
        1996:
        Fourth Quarter             $29 1/2      $22 3/4
        Third Quarter              $24 3/4      $12 1/4
        Second Quarter             $17 1/8      $13 1/2
        First Quarter              $19 3/4      $15 1/8
<End of table>

<Selection: PART 2 Item 5.>
ITEM 5. MARKET FOR REGISTRANT'S COMMON EQUITY AND RELATED STOCKHOLDER MATTERS


     The Common Stock is listed and traded on the New York Stock Exchange under
the symbol "SOC". The Company has paid quarterly cash dividends of $.01 per
share since December 15, 1992. The Company presently intends to continue to pay
cash dividends at a quarterly rate of $.01 per share; however, future payments
of cash dividends will be at the discretion of the Company's Board of Directors
and dependent upon the Company's results of operations, financial condition and
other relevant factors.


     The following table sets forth the high and low sale prices for the Common
Stock for the calendar quarters indicated as reported by the New York Stock
Exchange Composite Tape:



<TABLE>
<CAPTION>
                                         MARKET PRICE
                                   ------------------------
                                       HIGH          LOW
                                   -----------   ----------
<S>                                <C>           <C>
        1997:
        Fourth Quarter .........   $50 7/16     $37 
        Third Quarter ..........   $45 3/4      $35 3/8
        Second Quarter .........   $40 3/4      $29 3/4
        First Quarter ..........   $34 1/2      $24 5/8
        1996:
        Fourth Quarter .........   $29 1/2      $22 3/4
        Third Quarter ..........   $24 3/4      $12 1/4
        Second Quarter .........   $17 1/8      $13 1/2
        First Quarter ..........   $19 3/4      $15 1/8
</TABLE>

     On February 27, 1998 there were approximately 1,311 record holders of the
Company's Common Stock.


                                       12
<PAGE>

在 Python 中使用 URL 从 txt 中提取数据

问题描述

4 个解决方案

解决方案1
0 2020-10-31 06:42:49

解决方案2
0 2020-10-31 06:49:10

解决方案3
0 2020-10-31 06:58:25

解决方案4
0 2020-11-01 16:09:15

在 Python 中使用 URL 从 txt 中提取数据

问题描述

4 个解决方案

解决方案1 0 2020-10-31 06:42:49

解决方案2 0 2020-10-31 06:49:10

解决方案3 0 2020-10-31 06:58:25

解决方案4 0 2020-11-01 16:09:15

解决方案1
0 2020-10-31 06:42:49

解决方案2
0 2020-10-31 06:49:10

解决方案3
0 2020-10-31 06:58:25

解决方案4
0 2020-11-01 16:09:15