文本抓取（来自 EDGAR 10K Amazon）代码不起作用

Question

I have the below code to scrape some specific word list from the financial statements (US SEC EDGAR 10K) text file.我有以下代码从财务报表（美国证券交易委员会 EDGAR 10K）文本文件中抓取一些特定的单词列表。 Will highly appreciate if you anyone can help me with this.如果有人可以帮助我，我将不胜感激。 I have manually cross-checked and found the words in the document, but my code is not finding any word at all.我已经手动交叉检查并找到了文档中的单词，但是我的代码根本没有找到任何单词。 I am using Python 3.5.3.我正在使用 Python 3.5.3。 Thanks in advance提前致谢

Given a URL path for EDGAR 10-K file in .txt format for a company (CIK) in a year this code will perform a word count给定一个公司 (CIK) 一年内 .txt 格式的 EDGAR 10-K 文件的 URL 路径，此代码将执行字数统计

#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib.request as urllib2
import time
import csv
import sys

CIK = '0001018724'
Year = '2013'
string_match1 = 'edgar/data/1018724/0001193125-13-028520.txt'
url3 = 'https://www.sec.gov/Archives/' + string_match1
response3 = urllib2.urlopen(url3)
words = [
    'anticipate',
    'believe',
    'depend',
    'fluctuate',
    'indefinite',
    'likelihood',
    'possible',
    'predict',
    'risk',
    'uncertain',
    ]
count = {}  # is a dictionary data structure in Python
for elem in words:
    count[elem] = 0
for line in response3:
    elements = line.split()
    for word in words:
     count[word] = count[word] + elements.count(word)
print CIK
print Year
print url3
print count

Here is the script output:这是脚本输出：

0001018724

2013

https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt

{
    'believe': 0,
    'likelihood': 0,
    'anticipate': 0,
    'fluctuate': 0,
    'predict': 0,
    'risk': 0,
    'possible': 0,
    'indefinite': 0,
    'depend': 0,
    'uncertain': 0,
}

Answer 1

A simplified version of your code seems to work in Python 3.7 with the requests library:您的代码的简化版本似乎可以在 Python 3.7 中使用 requests 库：

import requests
url = 'https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt'
response = requests.get(url)

words = [your word list above ]


count = {}  # is a dictionary data structure in Python
for elem in words:
    count[elem] = 0
    info = str(response.content)
    count[elem] = count[elem] + info.count(elem)


print(count)

Output:输出：

    {'anticipate': 9, 'believe': 32, 'depend': 39, 'fluctuate': 4, 'indefinite': 15, 'likelihood': 15, 'possible': 25,
 'predict': 6, 'risk': 55, 'uncertain': 38}

文本抓取（来自 EDGAR 10K Amazon）代码不起作用

问题描述

Given a URL path for EDGAR 10-K file in .txt format for a company (CIK) in a year this code will perform a word count给定一个公司 (CIK) 一年内 .txt 格式的 EDGAR 10-K 文件的 URL 路径，此代码将执行字数统计

1 个解决方案

解决方案1
0 已采纳 2019-07-19 23:06:21

文本抓取（来自 EDGAR 10K Amazon）代码不起作用

问题描述

Given a URL path for EDGAR 10-K file in .txt format for a company (CIK) in a year this code will perform a word count给定一个公司 (CIK) 一年内 .txt 格式的 EDGAR 10-K 文件的 URL 路径，此代码将执行字数统计

1 个解决方案

解决方案1 0 已采纳 2019-07-19 23:06:21

解决方案1
0 已采纳 2019-07-19 23:06:21