![](/img/trans.png)
[英]Load json file with string into pandas dataframe in python
[英]pandas load dataframe from string
我想抓取SEC Edgar 13F表單(txt格式)並將其解析為pandas.DataFrame
,
原始數據鏈接: https : //www.sec.gov/Archives/edgar/data/1067983/000119312512060928/0001193125-12-060928.txt
我嘗試使用bs4來提取表,如下所示:
from bs4 import BeautifulSoup
def get_page(url):
url_client = urlopen(url)
page = url_client.read()
url_client.close()
return page
history_url = 'https://www.sec.gov/Archives/edgar/data/1067983/000119312513060317/0001193125-13-060317.txt'
txt_soup = BeautifulSoup(getPage(history_url), 'xml')
然后我從湯中提取桌子:
table = txt_soup.find_all('TABLE')[0]
table_header = table.contents[1].contents[0]
table_data = table.contents[1].contents[1]
table_data
看起來像這樣:
<S> <C> <C> <C> <C> <C> <C> <C> <C> <C>
AMERICAN
EXPRESS CO COM 025816109 112,209 1,952,142 Shared-Defined 4 1,952,142 - -
AMERICAN
EXPRESS CO COM 025816109 990,116 17,225,400 Shared-Defined 4, 5 17,225,400 - -
AMERICAN
EXPRESS CO COM 025816109 48,274 839,832 Shared-Defined 4, 7 839,832 - -
AMERICAN
EXPRESS CO COM 025816109 111,689 1,943,100 Shared-Defined 4, 8, 11 1,943,100 - -
AMERICAN
EXPRESS CO COM 025816109 459,532 7,994,634 Shared-Defined 4, 10 7,994,634 - -
AMERICAN
EXPRESS CO COM 025816109 6,912,308 120,255,879 Shared-Defined 4, 11 120,255,879 - -
AMERICAN
EXPRESS CO COM 025816109 80,456 1,399,713 Shared-Defined 4, 13 1,399,713 - -
ARCHER DANIELS
MIDLAND CO COM 039483102 163,151 5,956,600 Shared-Defined 4, 5 5,956,600 - -
現在我想將此str轉換為pandas.DataFrame
,我嘗試使用:
from io import StringIO
pd.read_csv(StringIO(table_data.text), header=None)
上面的代碼失敗,並返回錯誤:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 6
如何正確解析此類txt表? 有更好的方法嗎?
我對Pandas Dataframes不太了解,但是通過查看代碼,我相信我知道問題出在哪里。
在第3行:
EXPRESS CO COM 025816109 112,209 1,952,142 Shared-Defined 4 1,952,142 - -
似乎是用逗號分割數據(因為csv文件通常使用逗號定界符)。 因此,與其傳遞一個字段,不如傳遞六個字段:
EXPRESS CO COM 025816109 112
209 1
952
142 Shared-Defined 4 1
952
142 - -
我建議的解決方案是從table_data中刪除所有逗號:
table_data = table_data.replace(',', '')
然后再試一次。 請讓我知道這是怎么回事!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.