[英]Text Scraping (from EDGAR 10K Amazon) code not working
I have the below code to scrape some specific word list from the financial statements (US SEC EDGAR 10K) text file.我有以下代码从财务报表(美国证券交易委员会 EDGAR 10K)文本文件中抓取一些特定的单词列表。 Will highly appreciate if you anyone can help me with this.如果有人可以帮助我,我将不胜感激。 I have manually cross-checked and found the words in the document, but my code is not finding any word at all.我已经手动交叉检查并找到了文档中的单词,但是我的代码根本没有找到任何单词。 I am using Python 3.5.3.我正在使用 Python 3.5.3。 Thanks in advance提前致谢
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib.request as urllib2
import time
import csv
import sys
CIK = '0001018724'
Year = '2013'
string_match1 = 'edgar/data/1018724/0001193125-13-028520.txt'
url3 = 'https://www.sec.gov/Archives/' + string_match1
response3 = urllib2.urlopen(url3)
words = [
'anticipate',
'believe',
'depend',
'fluctuate',
'indefinite',
'likelihood',
'possible',
'predict',
'risk',
'uncertain',
]
count = {} # is a dictionary data structure in Python
for elem in words:
count[elem] = 0
for line in response3:
elements = line.split()
for word in words:
count[word] = count[word] + elements.count(word)
print CIK
print Year
print url3
print count
Here is the script output:这是脚本输出:
0001018724
2013
https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt
{
'believe': 0,
'likelihood': 0,
'anticipate': 0,
'fluctuate': 0,
'predict': 0,
'risk': 0,
'possible': 0,
'indefinite': 0,
'depend': 0,
'uncertain': 0,
}
A simplified version of your code seems to work in Python 3.7 with the requests library:您的代码的简化版本似乎可以在 Python 3.7 中使用 requests 库:
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt'
response = requests.get(url)
words = [your word list above ]
count = {} # is a dictionary data structure in Python
for elem in words:
count[elem] = 0
info = str(response.content)
count[elem] = count[elem] + info.count(elem)
print(count)
Output:输出:
{'anticipate': 9, 'believe': 32, 'depend': 39, 'fluctuate': 4, 'indefinite': 15, 'likelihood': 15, 'possible': 25,
'predict': 6, 'risk': 55, 'uncertain': 38}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.