简体   繁体   English

提取HTML文件中的特定元素并插入CSV

[英]Extracting particular element in HTML file and inserting into CSV

I have a HTML table stored in a file. 我有一个存储在文件中的HTML表。 I want to take each td value from the table which has the attribute like so : 我想从具有如下属性的表中获取每个td值:

<td describedby="grid_1-1" ... >Value for CSV</td>
<td describedby="grid_1-1" ... >Value for CSV2</td>
<td describedby="grid_1-1" ... >Value for CSV3</td>
<td describedby="grid_1-2" ... >Value for CSV4</td>

and I want to put it into a CSV file, with each new value taking up a new line in the CSV. 我想将其放入CSV文件中,每个新值在CSV文件中占一行。

So for the file above, the CSV produced would be : 因此,对于上面的文件,生成的CSV为:

Value for CSV
Value for CSV2
Value for CSV3

Value for CSV4 would be ignored as describedby="grid_1-2", not "grid_1-1". CSV4的值将被忽略,如=“ grid_1-2”所述,而不是“ grid_1-1”。

So I have tried this, however no matter what I try there seems to be (a) a blank line in between each printed line (b) a comma separating each char. 所以我已经尝试过了,但是无论我尝试什么,似乎(a)每个打印行之间都有一个空白行(b)每个逗号分隔一个逗号。

So the print is more like : 所以印刷品更像是:

V,a,l,u,e,f,o,r,C,S,V,

V,a,l,u,e,f,o,r,C,S,V,2

What silly thing have I done now? 我现在做了什么愚蠢的事情?

Thanks :) 谢谢 :)

import csv
import os
from bs4 import BeautifulSoup

with open("C:\\Users\\ADMIN\\Desktop\\test.html", 'r') as orig_f:
    soup = BeautifulSoup(orig_f.read())
    results = soup.findAll("td", {"describedby":"grid_1-1"})
    with open('C:\\Users\\ADMIN\\Desktop\\Deploy.csv', 'wb') as fp:
        a = csv.writer(fp, delimiter=',')
        for result in results :
            a.writerows(result)

If result is a string inside a list you need to wrap it in a list as writerows expects an iterable of iterables and iterates over the string: 如果result是列表中的字符串,则需要将其包装在列表中,因为writerows期望有一个iterables的iterable并对该字符串进行迭代:

a.writerows([result]) <- wrap in a list 

In your case you should use writerow and extract the text from each td tag in results: 在您的情况下,您应该使用writerow并从结果中的每个td标签提取文本:

  a.writerow([result.text]) # write the text from td element

You have all the td tags in your result list so you just need extract the text with .text. 您在结果列表中拥有所有的td标签,因此只需要使用.text提取文本。

use lxml and csv module. 使用lxmlcsv模块。

  1. Get all td text value which attribute describedby have value grid_1-1 by xpath() method of lxml. 通过lxml的xpath()方法获取所有describedby属性具有值grid_1-1td文本值。
  2. Open csv file in write mode. 在写入模式下打开csv文件。
  3. writer row into csv file by writerow() method 通过writerow()方法将writerow()行写入CSV文件

code: 码:

content = """
<body>
<td describedby="grid_1-1">Value for CSV</td>
<td describedby="grid_1-1">Value for CSV2</td>
<td describedby="grid_1-1">Value for CSV3</td>
<td describedby="grid_1-2">Value for CSV4</td>
</body>
"""
from lxml import etree
import csv
root = etree.fromstring(content)
l = root.xpath("//td[@describedby='grid_1-1']/text()")

with open('/home/vivek/Desktop/output.csv', 'wb') as fp:
     a = csv.writer(fp, delimiter=',')
     for i in l :
         a.writerow([i, ])

output: 输出:

Value for CSV
Value for CSV2
Value for CSV3
Value for CSV4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM