简体   繁体   English

文本抓取(来自 EDGAR 10K Amazon)代码不起作用

[英]Text Scraping (from EDGAR 10K Amazon) code not working

I have the below code to scrape some specific word list from the financial statements (US SEC EDGAR 10K) text file.我有以下代码从财务报表(美国证券交易委员会 EDGAR 10K)文本文件中抓取一些特定的单词列表。 Will highly appreciate if you anyone can help me with this.如果有人可以帮助我,我将不胜感激。 I have manually cross-checked and found the words in the document, but my code is not finding any word at all.我已经手动交叉检查并找到了文档中的单词,但是我的代码根本没有找到任何单词。 I am using Python 3.5.3.我正在使用 Python 3.5.3。 Thanks in advance提前致谢

Given a URL path for EDGAR 10-K file in .txt format for a company (CIK) in a year this code will perform a word count给定一个公司 (CIK) 一年内 .txt 格式的 EDGAR 10-K 文件的 URL 路径,此代码将执行字数统计

#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib.request as urllib2
import time
import csv
import sys

CIK = '0001018724'
Year = '2013'
string_match1 = 'edgar/data/1018724/0001193125-13-028520.txt'
url3 = 'https://www.sec.gov/Archives/' + string_match1
response3 = urllib2.urlopen(url3)
words = [
    'anticipate',
    'believe',
    'depend',
    'fluctuate',
    'indefinite',
    'likelihood',
    'possible',
    'predict',
    'risk',
    'uncertain',
    ]
count = {}  # is a dictionary data structure in Python
for elem in words:
    count[elem] = 0
for line in response3:
    elements = line.split()
    for word in words:
     count[word] = count[word] + elements.count(word)
print CIK
print Year
print url3
print count

Here is the script output:这是脚本输出:

0001018724

2013

https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt

{
    'believe': 0,
    'likelihood': 0,
    'anticipate': 0,
    'fluctuate': 0,
    'predict': 0,
    'risk': 0,
    'possible': 0,
    'indefinite': 0,
    'depend': 0,
    'uncertain': 0,
}

A simplified version of your code seems to work in Python 3.7 with the requests library:您的代码的简化版本似乎可以在 Python 3.7 中使用 requests 库:

import requests
url = 'https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt'
response = requests.get(url)

words = [your word list above ]


count = {}  # is a dictionary data structure in Python
for elem in words:
    count[elem] = 0
    info = str(response.content)
    count[elem] = count[elem] + info.count(elem)


print(count)

Output:输出:

    {'anticipate': 9, 'believe': 32, 'depend': 39, 'fluctuate': 4, 'indefinite': 15, 'likelihood': 15, 'possible': 25,
 'predict': 6, 'risk': 55, 'uncertain': 38}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从(Edgar 10-K 文件)HTML 中提取文本部分 - Extracting text section from (Edgar 10-K filings) HTML 从 edgar 中抓取特定数据 - Scraping specific data from edgar 从数据库中提取的10k记录的散点图 - Scatter plot of 10k record extracted from database DBSCAN 集群甚至无法处理 40k 数据,但使用 python 和 sklearn 处理 10k 数据 - DBSCAN clustering is not working even on 40k data but working on 10k data using python and sklearn 我有一个文本文件(12列和10K行)。 我想从文本文件中加载数据并馈入2D数组 - I have a text file (12 column and 10K rows). I want to load data from text file and feed to a 2D array 每天只运行 10k 次请求,第二天再运行 10k 次,依此类推 - Run only 10k requests per day and next day another 10k and so on EDGAR SEC 10-K 单个部分解析器 - EDGAR SEC 10-K Individual Sections Parser 如何从张量流中的数据集类获取10K MNIST图像的子集? - How to get subset of 10K MNIST images from Dataset class in tensorflow? 如何使用 Pandas 从 InfluxDB 检索超过 10k 行? - How to retrive more than 10k lines from InfluxDB using Pandas? 如何从超过 10K 行的多个文件绘制分布图 - How to plot a distribution plot from multiple files with over 10K lines
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM