简体   繁体   English

Python - Web Scraping HTML表格和打印到CSV

[英]Python - Web Scraping HTML table and printing to CSV

I'm pretty much brand new to Python, but I'm looking to build a webscraping tool that will rip data from an HTML table online and print it into a CSV in the same format. 我几乎是Python的新手,但我正在寻找一个网络编写工具,它将在线从HTML表中删除数据并以相同的格式将其打印成CSV。

Here's a sample of the HTML table (it's enormous, so I'm going to provide only a few rows). 这是HTML表的一个示例(它是巨大的,所以我将只提供几行)。

<div class="col-xs-12 tab-content">
        <div id="historical-data" class="tab-pane active">
          <div class="tab-header">
            <h2 class="pull-left bottom-margin-2x">Historical data for Bitcoin</h2>

            <div class="clear"></div>

            <div class="row">
              <div class="col-md-12">
                <div class="pull-left">
                  <small>Currency in USD</small>
                </div>
                <div id="reportrange" class="pull-right">
                    <i class="glyphicon glyphicon-calendar fa fa-calendar"></i>&nbsp;
                    <span>Aug 16, 2017 - Sep 15, 2017</span> <b class="caret"></b>
                </div>
              </div>
            </div>

            <table class="table">
              <thead>
              <tr>
                <th class="text-left">Date</th>
                <th class="text-right">Open</th>
                <th class="text-right">High</th>
                <th class="text-right">Low</th>
                <th class="text-right">Close</th>
                <th class="text-right">Volume</th>
                <th class="text-right">Market Cap</th>
              </tr>
              </thead>
              <tbody>

                <tr class="text-right">
                  <td class="text-left">Sep 14, 2017</td>
                  <td>3875.37</td>     
                  <td>3920.60</td>
                  <td>3153.86</td>
                  <td>3154.95</td>
                  <td>2,716,310,000</td>
                  <td>64,191,600,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 13, 2017</td>
                  <td>4131.98</td>     
                  <td>4131.98</td>
                  <td>3789.92</td>
                  <td>3882.59</td>
                  <td>2,219,410,000</td>
                  <td>68,432,200,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 12, 2017</td>
                  <td>4168.88</td>     
                  <td>4344.65</td>
                  <td>4085.22</td>
                  <td>4130.81</td>
                  <td>1,864,530,000</td>
                  <td>69,033,400,000</td>
                </tr>                
              </tbody>
            </table>
          </div>

        </div>
    </div>

I'm particularly interested in recreating the table with the same column headers provided: "Date," "Open," "High," "Low," "Close," "Volume," "Market Cap." 我特别感兴趣的是重新创建具有相同列标题的表:“日期”,“打开”,“高”,“低”,“关闭”,“音量”,“市值”。 Currently, I've been able to write a simple script that will essentially go to the URL, download the HTML, parse with BeautifulSoup, and then use 'for' statements to get the td elements. 目前,我已经能够编写一个简单的脚本,它将基本上转到URL,下载HTML,使用BeautifulSoup解析,然后使用'for'语句来获取td元素。 Below a sample of my code (URL omitted) and the result: 下面是我的代码示例(URL省略)和结果:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

url = "enterURLhere"
page = requests.get(url)
pagetext = page.text

pricetable = {
    "Date" : [],
    "Open" : [],
    "High" : [],
    "Low" : [],
    "Close" : [],
    "Volume" : [],
    "Market Cap" : []
}

soup = BeautifulSoup(pagetext, 'html.parser')

file = open("test.csv", 'w')

for row in soup.find_all('tr'):
    for col in row.find_all('td'):
        print(col.text)

sample output 样本输出

Anyone have any pointers on how to at least reformat the data pull into the table? 任何人都有关于如何至少重新格式化数据的指针? Thanks. 谢谢。

Run the code and you will get your desired data from that table. 运行代码,您将从该表中获得所需的数据。 To give it a go and extract the data from this very element, all you need to do is wrap the whole html element, which you have pasted above, within html=''' ''' 要给它一个并从这个元素中提取数据,你需要做的就是将你粘贴在上面的整个html元素包装在html=''' '''

import csv
from bs4 import BeautifulSoup

outfile = open("table_data.csv","w",newline='')
writer = csv.writer(outfile)

tree = BeautifulSoup(html,"lxml")
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("th,td")]
                for row_data in table_tag.select("tr")]

for data in tab_data:
    writer.writerow(data)
    print(' '.join(data))

I've tried to break the code into pieces to make you understand. 我试图将代码分解成碎片让你理解。 What I did above is a nested for loop. 我上面做的是一个嵌套的for循环。 Here is how it goes separately: 以下是它如何分开:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
table = soup.find('table')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

Result: 结果:

Date Open High Low Close Volume Market Cap
Sep 14, 2017 3875.37 3920.60 3153.86 3154.95 2,716,310,000 64,191,600,000
Sep 13, 2017 4131.98 3789.92 3882.59 2,219,410,000 68,432,200,000
Sep 12, 2017 4168.88 4344.65 4085.22 4130.81 1,864,530,000 69,033,400,000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM