提取HTML文件中的特定元素并插入CSV

Question

I have a HTML table stored in a file. 我有一个存储在文件中的HTML表。 I want to take each td value from the table which has the attribute like so : 我想从具有如下属性的表中获取每个td值：

<td describedby="grid_1-1" ... >Value for CSV</td>
<td describedby="grid_1-1" ... >Value for CSV2</td>
<td describedby="grid_1-1" ... >Value for CSV3</td>
<td describedby="grid_1-2" ... >Value for CSV4</td>

and I want to put it into a CSV file, with each new value taking up a new line in the CSV. 我想将其放入CSV文件中，每个新值在CSV文件中占一行。

So for the file above, the CSV produced would be : 因此，对于上面的文件，生成的CSV为：

Value for CSV
Value for CSV2
Value for CSV3

Value for CSV4 would be ignored as describedby="grid_1-2", not "grid_1-1". CSV4的值将被忽略，如=“ grid_1-2”所述，而不是“ grid_1-1”。

So I have tried this, however no matter what I try there seems to be (a) a blank line in between each printed line (b) a comma separating each char. 所以我已经尝试过了，但是无论我尝试什么，似乎（a）每个打印行之间都有一个空白行（b）每个逗号分隔一个逗号。

So the print is more like : 所以印刷品更像是：

V,a,l,u,e,f,o,r,C,S,V,

V,a,l,u,e,f,o,r,C,S,V,2

What silly thing have I done now? 我现在做了什么愚蠢的事情？

Thanks :) 谢谢：）

import csv
import os
from bs4 import BeautifulSoup

with open("C:\\Users\\ADMIN\\Desktop\\test.html", 'r') as orig_f:
    soup = BeautifulSoup(orig_f.read())
    results = soup.findAll("td", {"describedby":"grid_1-1"})
    with open('C:\\Users\\ADMIN\\Desktop\\Deploy.csv', 'wb') as fp:
        a = csv.writer(fp, delimiter=',')
        for result in results :
            a.writerows(result)

Answer 1

If result is a string inside a list you need to wrap it in a list as writerows expects an iterable of iterables and iterates over the string: 如果result是列表中的字符串，则需要将其包装在列表中，因为writerows期望有一个iterables的iterable并对该字符串进行迭代：

a.writerows([result]) <- wrap in a list

In your case you should use writerow and extract the text from each td tag in results: 在您的情况下，您应该使用writerow并从结果中的每个td标签提取文本：

  a.writerow([result.text]) # write the text from td element

You have all the td tags in your result list so you just need extract the text with .text. 您在结果列表中拥有所有的td标签，因此只需要使用.text提取文本。

Answer 2

use lxml and csv module. 使用lxml和csv模块。

Get all td text value which attribute describedby have value grid_1-1 by xpath() method of lxml. 通过lxml的xpath()方法获取所有describedby属性具有值grid_1-1的td文本值。
Open csv file in write mode. 在写入模式下打开csv文件。
writer row into csv file by writerow() method 通过writerow()方法将writerow()行写入CSV文件

code: 码：

content = """
<body>
<td describedby="grid_1-1">Value for CSV</td>
<td describedby="grid_1-1">Value for CSV2</td>
<td describedby="grid_1-1">Value for CSV3</td>
<td describedby="grid_1-2">Value for CSV4</td>
</body>
"""
from lxml import etree
import csv
root = etree.fromstring(content)
l = root.xpath("//td[@describedby='grid_1-1']/text()")

with open('/home/vivek/Desktop/output.csv', 'wb') as fp:
     a = csv.writer(fp, delimiter=',')
     for i in l :
         a.writerow([i, ])

output: 输出：

Value for CSV
Value for CSV2
Value for CSV3
Value for CSV4

提取HTML文件中的特定元素并插入CSV

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-01-28 13:43:09

解决方案2
1 2015-01-28 13:51:05

提取HTML文件中的特定元素并插入CSV

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-01-28 13:43:09

解决方案2 1 2015-01-28 13:51:05

解决方案1
3 已采纳 2015-01-28 13:43:09

解决方案2
1 2015-01-28 13:51:05