简体   繁体   English

从网上提取表格

[英]Extracting tables from web

I need to extract all tables from this web:(only the second column) https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表 我需要从此网站提取所有表格:(仅第二列) https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表

Well, the last three tables I don't need it... 好吧,我不需要最后三张桌子...

However, my code only extract the second column from the first table. 但是,我的代码仅从第一个表中提取第二列。

 import pickle
 import requests
 def save_china_tickers():
     resp = requests.get('https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表')
     soup = bs.BeautifulSoup(resp.text, 'lxml')
     table = soup.find('table', {'class':'wikitable'})
     tickers=[]
     for row in table.findAll('tr')[1:]:
         ticker = row.findAll('td')[1].text
         tickers.append(ticker)
             with open('chinatickers.pickle','wb') as f:
         pickle.dump(tickers,f)
         return tickers save_china_tickers()

I have an easy method. 我有一个简单的方法。

  1. Get HTTP Response 获取HTTP响应
  2. Find all tables using RegEx 使用RegEx查找所有表
  3. Parse HTML Table to list of lists 解析HTML表到列表列表
  4. Iterate the over each list in list 遍历列表中的每个列表
Requirements 要求
  1. dashtable dashtable
Code
 from urllib.request import urlopen from dashtable import html2data # to convert html table to list of list import re url = "https://zh.wikipedia.org/wiki/%E4%B8%8A%E6%B5%B7%E8%AF%81%E5%88%B8%E4%BA%A4%E6%98%93%E6%89%80%E4%B8%8A%E5%B8%82%E5%85%AC%E5%8F%B8%E5%88%97%E8%A1%A8" # Reading http content data = urlopen(url).read().decode() # now fetching all tables with the help of regex tables = ["<table>{}</table>".format(table) for table in re.findall(r"<table .*?>(.*?)</table>", data, re.M|re.S|re.I)] # parsing data parsed_tables = [html2data(table)[0] for table in tables] # html2data returns a tuple with 0th index as list of lists # lets take first table ie 600000-600099 parsed = parsed_tables[0] # column names of first table print(parsed[0]) # rows of first table 2nd column for index in range(1, len(parsed)): print(parsed[index][1]) """ Output: All the rows of table 1, column 2 excluding the headers """ 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM