简体   繁体   English

如何在Python中用逗号替换HTML标签(对于CSV)?

[英]How do you replace HTML tags with commas (for CSV) in Python?

I have an extremely long HTML file that I cannot modify but would like to parse for CSV output. 我有一个非常长的HTML文件,我无法修改,但想解析CSV输出。 Imagine the following code repeated hundreds of times all on the same line. 想象下面的代码在同一行上重复了数百次。 I realize this would be much simpler if there were line breaks, but I have no control over how the file is created. 我知道,如果有换行符,这会简单得多,但是我无法控制文件的创建方式。 You should also know that there are no friendly line breaks in this code; 您还应该知道,该代码中没有友好的换行符。 imagine fully minified code. 想象完全压缩的代码。 I have just added breaks so it's easier to visualize. 我刚刚添加了休息时间,因此更易于可视化。 But, any actual solution to this would not be able to rely on line breaks or spaces since they will not exist in reality. 但是,对此的任何实际解决方案都将不能依靠换行符或空格,因为它们实际上将不存在。

<tr id="link">
<td><a href="https://www.somewebsite.com" target="_target">Title</a></td>
<td>Value 1</td><td style="width:20ch">Value 2</td>
<td></td><td></td><td>Value 3</td>
<td>Value 4</td><td>Value 5</td><td>Value 6</td>
<td>Value 7</td><td>Value 8</td><td>Value 9</td></tr>

My desired output from this is https://www.somewebsite.com, Title, Value 1, Value 2, , , Value 3, ... (etc.) 我想要的输出是https://www.somewebsite.com, Title, Value 1, Value 2, , , Value 3, ... (等等)

Basically, I want to replace all values in tags with commas but retain the URL. 基本上,我想用逗号替换标记中的所有值,但保留URL。 I cannot find any way in Python to parse something like this since the scan(), find(), etc. functions in Python do not seem to keep track of the file pointer globally as I'm used to in languages like C. So, no matter what I do I'm continually just looking at the beginning of the line. 我在Python中找不到任何方法来解析这样的内容,因为Python中的scan(),find()等函数似乎并没有像我在C这样的语言中那样全局地跟踪文件指针。 ,无论我做什么,我一直只是在看这行的开头。

from bs4 import BeautifulSoup

html_doc = """
<tr id="link">
<td><a href="https://www.somewebsite.com" target="_target">Title</a></td>
<td>Value 1</td><td style="width:20ch">Value 2</td>
<td></td><td></td><td>Value 3</td>
<td>Value 4</td><td>Value 5</td><td>Value 6</td>
<td>Value 7</td><td>Value 8</td><td>Value 9</td></tr>"""

for tr in BeautifulSoup(html_doc, 'html.parser').find_all('tr'):
    row = []
    for td in tr.find_all('td'):
        anchor = td.find('a')
        row.extend([anchor['href'], anchor.text] if anchor else [td.text])
    print(', '.join(row))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM