[英]Extracting tables from web
I need to extract all tables from this web:(only the second column) https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表 我需要从此网站提取所有表格:(仅第二列) https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表
Well, the last three tables I don't need it... 好吧,我不需要最后三张桌子...
However, my code only extract the second column from the first table. 但是,我的代码仅从第一个表中提取第二列。
import pickle
import requests
def save_china_tickers():
resp = requests.get('https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'class':'wikitable'})
tickers=[]
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[1].text
tickers.append(ticker)
with open('chinatickers.pickle','wb') as f:
pickle.dump(tickers,f)
return tickers save_china_tickers()
I have an easy method. 我有一个简单的方法。
from urllib.request import urlopen from dashtable import html2data # to convert html table to list of list import re url = "https://zh.wikipedia.org/wiki/%E4%B8%8A%E6%B5%B7%E8%AF%81%E5%88%B8%E4%BA%A4%E6%98%93%E6%89%80%E4%B8%8A%E5%B8%82%E5%85%AC%E5%8F%B8%E5%88%97%E8%A1%A8" # Reading http content data = urlopen(url).read().decode() # now fetching all tables with the help of regex tables = ["<table>{}</table>".format(table) for table in re.findall(r"<table .*?>(.*?)</table>", data, re.M|re.S|re.I)] # parsing data parsed_tables = [html2data(table)[0] for table in tables] # html2data returns a tuple with 0th index as list of lists # lets take first table ie 600000-600099 parsed = parsed_tables[0] # column names of first table print(parsed[0]) # rows of first table 2nd column for index in range(1, len(parsed)): print(parsed[index][1]) """ Output: All the rows of table 1, column 2 excluding the headers """
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.