简体   繁体   中英

Extracting particular element in HTML file and inserting into CSV

I have a HTML table stored in a file. I want to take each td value from the table which has the attribute like so :

<td describedby="grid_1-1" ... >Value for CSV</td>
<td describedby="grid_1-1" ... >Value for CSV2</td>
<td describedby="grid_1-1" ... >Value for CSV3</td>
<td describedby="grid_1-2" ... >Value for CSV4</td>

and I want to put it into a CSV file, with each new value taking up a new line in the CSV.

So for the file above, the CSV produced would be :

Value for CSV
Value for CSV2
Value for CSV3

Value for CSV4 would be ignored as describedby="grid_1-2", not "grid_1-1".

So I have tried this, however no matter what I try there seems to be (a) a blank line in between each printed line (b) a comma separating each char.

So the print is more like :

V,a,l,u,e,f,o,r,C,S,V,

V,a,l,u,e,f,o,r,C,S,V,2

What silly thing have I done now?

Thanks :)

import csv
import os
from bs4 import BeautifulSoup

with open("C:\\Users\\ADMIN\\Desktop\\test.html", 'r') as orig_f:
    soup = BeautifulSoup(orig_f.read())
    results = soup.findAll("td", {"describedby":"grid_1-1"})
    with open('C:\\Users\\ADMIN\\Desktop\\Deploy.csv', 'wb') as fp:
        a = csv.writer(fp, delimiter=',')
        for result in results :
            a.writerows(result)

If result is a string inside a list you need to wrap it in a list as writerows expects an iterable of iterables and iterates over the string:

a.writerows([result]) <- wrap in a list 

In your case you should use writerow and extract the text from each td tag in results:

  a.writerow([result.text]) # write the text from td element

You have all the td tags in your result list so you just need extract the text with .text.

use lxml and csv module.

  1. Get all td text value which attribute describedby have value grid_1-1 by xpath() method of lxml.
  2. Open csv file in write mode.
  3. writer row into csv file by writerow() method

code:

content = """
<body>
<td describedby="grid_1-1">Value for CSV</td>
<td describedby="grid_1-1">Value for CSV2</td>
<td describedby="grid_1-1">Value for CSV3</td>
<td describedby="grid_1-2">Value for CSV4</td>
</body>
"""
from lxml import etree
import csv
root = etree.fromstring(content)
l = root.xpath("//td[@describedby='grid_1-1']/text()")

with open('/home/vivek/Desktop/output.csv', 'wb') as fp:
     a = csv.writer(fp, delimiter=',')
     for i in l :
         a.writerow([i, ])

output:

Value for CSV
Value for CSV2
Value for CSV3
Value for CSV4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM