简体   繁体   English

如何在没有外部模块的情况下清理 Python 3 文本块?

[英]How to sanitise a block of text Python 3 no external modules?

Recently was set a hackerrank to do and I couldn't get a block of text to properly be sanitized from tags without breaking the text in Python 3.最近被设置了一个hackerrank,我无法在不破坏Python 3中的文本的情况下从标签中正确清理文本块。

Two example inputs were provided (below) and the challenge was to clear them to make them safe normal text blocks.提供了两个示例输入(如下),挑战在于清除它们以使其成为安全的普通文本块。 Time to complete the challenge is over but I'm confused how I got something so simple so wrong.完成挑战的时间已经结束,但我很困惑我怎么会得到如此简单如此错误的东西。 Any help on how I should've gone about it would be appreciated.任何有关我应该如何处理的帮助将不胜感激。

Test input one测试输入一

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
var y=window.prompt("Hello")
window.alert(y)
</script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

Test input two测试输入二

In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work.  The full details of your in-text references, <script language="JavaScript">
document.write("Page. Last update:" + document.lastModified); </script>When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. 
The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

Test proposed output 1测试建议的输出 1

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

Test proposed output 2测试建议的输出 2

  In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work. The full details of your in-text references, When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

Thanks in advance!提前致谢!

EDIT (Using @YakovDan's sanitisation) : The code:编辑(使用@YakovDan 的消毒):代码:

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


    return out_str

inp=input()
print(sanitize(inp))

The input:输入:

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
 var y=window.prompt("Hello")
 window.alert(y)
 </script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

The output:输出:

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.一个长期存在的事实是,读者在查看页面布局时会被页面的可读内容分散注意力。 The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English.使用 Lorem Ipsum 的重点在于它或多或少地具有正态分布的字母,而不是使用“此处的内容,此处的内容”,使其看起来像可读的英语。 Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy.许多桌面出版软件包和网页编辑器现在使用 Lorem Ipsum 作为默认模型文本,搜索“lorem ipsum”将发现许多仍处于起步阶段的网站。

What the output should be:输出应该是什么:

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.一个长期存在的事实是,读者在查看页面布局时会被页面的可读内容分散注意力。 The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English.使用 Lorem Ipsum 的重点在于它或多或少地具有正态分布的字母,而不是使用“此处的内容,此处的内容”,使其看起来像可读的英语。 Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy.Contrary to popular belief, Lorem Ipsum is not simply random text.许多桌面出版软件包和网页编辑器现在使用 Lorem Ipsum 作为默认模型文本,搜索“lorem ipsum”将发现许多仍处于起步阶段的网站。与流行的看法相反,Lorem Ipsum 不仅仅是随机文本。 It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.它源于公元前 45 年的一段古典拉丁文学,已有 2000 多年的历史。 Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.弗吉尼亚州汉普登-悉尼学院的拉丁语教授理查德麦克林托克从 Lorem Ipsum 段落中查找了一个较为晦涩的拉丁词 consectetur。

In general, regular expressions are the wrong tool for parsing HTML tags ( see here ), but it will work for this job since the tags are simple - if you have non-regular (tags which don't have closing tags etc.) inputs, it will fail.一般来说,正则表达式是解析 HTML 标签的错误工具( 请参阅此处),但它适用于这项工作,因为标签很简单 - 如果您有非常规(没有结束标签的标签等)输入,它会失败。

That being said, for this two examples, you can use this regex :话虽如此,对于这两个示例,您可以使用此正则表达式

<.*?>.*?<\s*?\/.*?>

Implemented in Python:在 Python 中实现:

import re
s = one of your long strings
r = re.sub('<.*?>.*?<\s*?\/.*?>', '', s, flags=re.DOTALL)
print(r)

which gives the expected results (too long-winded to copy them in!).这给出了预期的结果(复制它们太啰嗦了!)。

Here's a way to do this without regex.这是一种无需正则表达式即可执行此操作的方法。

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


     return out_str

This should do it (up to assumptions about tags)这应该这样做(取决于关于标签的假设)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM