简体   繁体   中英

convert html table to csv in python

I'm trying to scrape a table from a dynamic page. After the following code (requires selenium), I manage to get the contents of the <table> elements.

I'd like to convert this table into a csv and I have tried 2 things, but both fail:

  • pandas.read_html returns an error saying I don't have html5lib installed, but I do and in fact I can import it without problems.
  • soup.find_all('tr') returns an error 'NoneType' object is not callable after I run soup = BeautifulSoup(tablehtml)

Here is my code:

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.keys import Keys
import pandas as pd

main_url = "http://data.stats.gov.cn/english/easyquery.htm?cn=E0101"
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(7)
driver.find_element_by_partial_link_text("Industry").click()
time.sleep(7)
driver.find_element_by_partial_link_text("Main Economic Indicat").click()
time.sleep(6)
driver.find_element_by_id("mySelect_sj").click()
time.sleep(2)
driver.find_element_by_class_name("dtText").send_keys("last72")
time.sleep(3)
driver.find_element_by_class_name("dtTextBtn").click()
time.sleep(2)
table=driver.find_element_by_id("table_main")
tablehtml= table.get_attribute('innerHTML')

Without access to the table you're actually trying to scrape, I used this example:

<table>
<thead>
<tr>
    <td>Header1</td>
    <td>Header2</td>
    <td>Header3</td>
</tr>
</thead>  
<tr>
    <td>Row 11</td>
    <td>Row 12</td>
    <td>Row 13</td>
</tr>
<tr>
    <td>Row 21</td>
    <td>Row 22</td>
    <td>Row 23</td>
</tr>
<tr>
    <td>Row 31</td>
    <td>Row 32</td>
    <td>Row 33</td>
</tr>
</table>

and scraped it using:

from bs4 import BEautifulSoup as BS
content = #contents of that table
soup = BS(content, 'html5lib')
rows = [tr.findAll('td') for tr in soup.findAll('tr')]

This rows object is a list of lists:

[
    [<td>Header1</td>, <td>Header2</td>, <td>Header3</td>],
    [<td>Row 11</td>, <td>Row 12</td>, <td>Row 13</td>],
    [<td>Row 21</td>, <td>Row 22</td>, <td>Row 23</td>],
    [<td>Row 31</td>, <td>Row 32</td>, <td>Row 33</td>]
]

...and you can write it to a file:

for it in rows:
with open('result.csv', 'a') as f:
    f.write(", ".join(str(e).replace('<td>','').replace('</td>','') for e in it) + '\n')

which looks like this:

Header1, Header2, Header3
Row 11, Row 12, Row 13
Row 21, Row 22, Row 23
Row 31, Row 32, Row 33

Using the csv module and selenium selectors would probably be more convenient here:

import csv
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://example.com/")
table = driver.find_element_by_css_selector("#tableid")
with open('eggs.csv', 'w', newline='') as csvfile:
    wr = csv.writer(csvfile)
    for row in table.find_elements_by_css_selector('tr'):
        wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM