[英]How to delete a specific part of an html file in Python
I am working on a html file which has item 1, item 2, and item 3. I want to delete all the text that comes after item 2. I can find item 2 in the file like this: 我正在处理一个包含项目1,项目2和项目3的html文件。我想删除项目2之后的所有文本。我可以在文件中找到项目2,如下所示:
Item2= re.compile (r'(Item 2)',re.I|re.S)
Item2match= Item2.findall(file)
but I don't know how can I delete the text that comes after it. 但我不知道如何删除后面的文字。
Simply use string methods to split the html text and take the first part; 只需使用字符串方法分割html文本并采用第一部分;
str.partition()
works much simpler: str.partition()
工作简单得多:
file.partition('Item 2')[0]
If you wanted to keep the Item 2
text too, use: 如果您也想保留
Item 2
文字,请使用:
''.join(file.partition('Item 2')[:2])
There is no need to use a regular expression here; 此处无需使用正则表达式; you are matching literal text.
您正在匹配文字文本。 Regular expressions is a wonderfully expressive and powerfool tool, but don't use it if there are simpler alternatives.
正则表达式是一种出色的表现力和强大的工具,但是如果有更简单的选择,则不要使用它。
Demo: 演示:
>>> 'Some text with Item 2 in it'.partition('Item 2')[0]
'Some text with '
>>> ''.join('Some text with Item 2 in it'.partition('Item 2')[:2])
'Some text with Item 2'
>>> re.sub(r'(?s)(?<=Item 2)(.*)', '', file)
Example: 例:
>>> s
'Item 2...feiugeogherger\nfjweifjwef\nsfjioweiefjwe'
>>> re.sub(r'(?s)(?<=Item 2)(.*)', '', s)
'Item 2'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.