使用python从第一个单元格HTML中删除文本

Question

I have this file: 我有这个文件：

    <table>
    <tr>
    <td WIDTH="49%">
    <p><a href="...1.htm"> cell to remove</a></p></td>
    <td WIDTH="51%"> some text </td>
    </tr>

I need as result this: 因此，我需要：

    <table>
    <tr>
    <td> 
    </td>
    <td WIDTH="51%"> some text </td>
    </tr>

I am trying to read the file with this html and replace my first tag with an empty one: 我正在尝试使用此html读取文件，并用一个空标签替换我的第一个标签：

   ret = open('rec1.txt').read()
   re.sub('<td[^/td>]+>','<td> </td>',ret, 1)
   final= open('rec2.txt', 'w')
   final.write(ret)
   final.close()

As you can see i am new using python and something, when i read rec2.txt it contains exactly the same text of the previous file. 如您所见，我是使用python等的新手，当我阅读rec2.txt时，它包含与先前文件完全相同的文本。

tks TKS

Answer 1

Using regex to parse HTML is a very bad practice (see @Lutz Horn's link in the comment). 使用正则表达式解析HTML是一种非常糟糕的做法（请参阅注释中的@Lutz Horn的链接）。

Use an HTML parser instead. 请改用HTML解析器。 For example, here's how you can set the value of the first td tag to empty using BeautifulSoup : 例如，以下是使用BeautifulSoup将第一个td标签的值设置为空的方法：

Beautiful Soup is a Python library for pulling data out of HTML and XML files. Beautiful Soup是一个Python库，用于从HTML和XML文件中提取数据。 It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. 它与您最喜欢的解析器一起使用，提供了导航，搜索和修改解析树的惯用方式。 It commonly saves programmers hours or days of work. 通常可以节省程序员数小时或数天的工作时间。

from bs4 import BeautifulSoup


data = """
<table>
    <tr>
        <td WIDTH="49%">
            <p><a href="...1.htm"> cell to remove</a></p>
        </td>
        <td WIDTH="51%">
            some text
        </td>
    </tr>
</table>"""

soup = BeautifulSoup(data, 'html.parser')
cell = soup.table.tr.td
cell.string = ''
cell.attrs = {}

print soup.prettify(formatter='html')

prints: 打印：

<table>
 <tr>
  <td>
  </td>
  <td width="51%">
   some text
  </td>
 </tr>
</table>

See also: 也可以看看：

Parsing HTML in Python 用Python解析HTML
Parsing HTML using Python 使用Python解析HTML

Hope that helps. 希望能有所帮助。

Answer 2

Using regex to parse HTML is a very bad practice. 使用正则表达式解析HTML是非常不好的做法。 If you are actually trying to modify HTML, use an HTML parser. 如果您实际上是在尝试修改HTML，请使用HTML解析器。

If the question is academic, or you are only trying to make the limited transformation you describe in the question, here is a regex program that will do it: 如果问题是学术性的，或者您只想进行问题中描述的有限转换，则可以使用以下正则表达式程序：

#!/usr/bin/python
import re
ret = open('rec1.txt').read()
ret = re.sub('<td.*?/td>','<td> </td>',ret, 1, re.DOTALL)
final= open('rec2.txt', 'w')
final.write(ret)
final.close()

Notes: 笔记：

The expression [/td] means match any one of / , t , or d in any order. 表达[/td]表示匹配中的任一项 / ， t ，或d以任何顺序。 Note instead how I used .* to match an arbitrary string followed by /td . 注意，请注意我是如何使用.*来匹配后跟/td的任意字符串的。
The final, optional, argument to re.sub() is a flags argument. re.sub()的最后一个可选参数是flags参数。 re.DOTALL allows . re.DOTALL允许. to match new lines. 匹配新行。
The ? ? means to perform a non-greedy search, so it will only consume one cell. 表示执行非贪婪搜索，因此它只会消耗一个单元格。
re.sub() returns the resulting string, it does not modify the string in place. re.sub()返回结果字符串，它不会修改该字符串。

使用python从第一个单元格HTML中删除文本

问题描述

2 个解决方案

解决方案1
4 已采纳 2014-03-10 14:56:13

解决方案2
1 2014-03-10 15:00:57

使用python从第一个单元格HTML中删除文本

问题描述

2 个解决方案

解决方案1 4 已采纳 2014-03-10 14:56:13

解决方案2 1 2014-03-10 15:00:57

解决方案1
4 已采纳 2014-03-10 14:56:13

解决方案2
1 2014-03-10 15:00:57