如何在Python中用逗号替换HTML标签（对于CSV）？

Question

I have an extremely long HTML file that I cannot modify but would like to parse for CSV output. 我有一个非常长的HTML文件，我无法修改，但想解析CSV输出。 Imagine the following code repeated hundreds of times all on the same line. 想象下面的代码在同一行上重复了数百次。 I realize this would be much simpler if there were line breaks, but I have no control over how the file is created. 我知道，如果有换行符，这会简单得多，但是我无法控制文件的创建方式。 You should also know that there are no friendly line breaks in this code; 您还应该知道，该代码中没有友好的换行符。 imagine fully minified code. 想象完全压缩的代码。 I have just added breaks so it's easier to visualize. 我刚刚添加了休息时间，因此更易于可视化。 But, any actual solution to this would not be able to rely on line breaks or spaces since they will not exist in reality. 但是，对此的任何实际解决方案都将不能依靠换行符或空格，因为它们实际上将不存在。

<tr id="link">
<td><a href="https://www.somewebsite.com" target="_target">Title</a></td>
<td>Value 1</td><td style="width:20ch">Value 2</td>
<td></td><td></td><td>Value 3</td>
<td>Value 4</td><td>Value 5</td><td>Value 6</td>
<td>Value 7</td><td>Value 8</td><td>Value 9</td></tr>

My desired output from this is https://www.somewebsite.com, Title, Value 1, Value 2, , , Value 3, ... (etc.) 我想要的输出是https://www.somewebsite.com, Title, Value 1, Value 2, , , Value 3, ... （等等）

Basically, I want to replace all values in tags with commas but retain the URL. 基本上，我想用逗号替换标记中的所有值，但保留URL。 I cannot find any way in Python to parse something like this since the scan(), find(), etc. functions in Python do not seem to keep track of the file pointer globally as I'm used to in languages like C. So, no matter what I do I'm continually just looking at the beginning of the line. 我在Python中找不到任何方法来解析这样的内容，因为Python中的scan（），find（）等函数似乎并没有像我在C这样的语言中那样全局地跟踪文件指针。，无论我做什么，我一直只是在看这行的开头。

Answer 1

from bs4 import BeautifulSoup

html_doc = """
<tr id="link">
<td><a href="https://www.somewebsite.com" target="_target">Title</a></td>
<td>Value 1</td><td style="width:20ch">Value 2</td>
<td></td><td></td><td>Value 3</td>
<td>Value 4</td><td>Value 5</td><td>Value 6</td>
<td>Value 7</td><td>Value 8</td><td>Value 9</td></tr>"""

for tr in BeautifulSoup(html_doc, 'html.parser').find_all('tr'):
    row = []
    for td in tr.find_all('td'):
        anchor = td.find('a')
        row.extend([anchor['href'], anchor.text] if anchor else [td.text])
    print(', '.join(row))

如何在Python中用逗号替换HTML标签（对于CSV）？

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-08-10 21:32:39

如何在Python中用逗号替换HTML标签（对于CSV）？

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-08-10 21:32:39

解决方案1
0 已采纳 2016-08-10 21:32:39