简体   繁体   English

如何在python中将html文本的大小写更改为句子大小写

[英]How to change the case of html text to sentence case in python

see I have a string containing html text, lets called it S. 看到我有一个包含html文本的字符串,称之为S。

S = "<b>this is a sentence. and this is one more sentence</b>"

and I want is to convert above S into following text 我想将上面的S转换为以下文本

S = <b>This is a sentence. And this is one more sentence</b>

The problem is that I can convert any text to sentence case using my function but when the text contains html there is no way to tell my function which part is text and which part is html that should be avoided. 问题是我可以使用我的函数将任何文本转换为句子大小写,但是当文本包含html时,没有办法告诉我的函数应该避免的是文本的一部分和html的哪一部分。 and therefore when I give S as input to my function it yields incorrect result as following 因此,当我将S作为函数的输入时,它会产生以下错误结果

S = <b>this is a sentence. And this is one more sentence</b>

Because it considered '<' as first character of sentence and so it tried converting '<' into uppercase which is same as '<'. 因为它认为“ <”是句子的第一个字符,所以它尝试将“ <”转换为与“ <”相同的大写字母。

My question to you folks now is that how to convert text into sentence case in python if text is already encoded in html form ? 我现在对大家的问题是,如果文本已经以html格式编码,如何在python中将文本转换为句子大小写? And I dont wanna loose HTML formating 而且我不想放松HTML格式

An overly simplistic approach would be 过于简单的方法是

import xml.etree.ElementTree as ET
S = "<b> This is sentence. and this is one more. </b>"

delim = '. ' 

def convert(sentence):
    return sentence[0].upper() + sentence[1:] + delim


def convert_node(child):
    sentences = child.text
    if sentences:
        child.text = ''
        for sentence in sentences.split(delim):
            if sentence:
                child.text += convert(sentence)
    sentences = child.tail
    if sentences:
        child.tail = ''
        for sentence in sentences.split(delim):
            if sentence:
                child.tail += convert(sentence)
    return child

node = ET.fromstring(S)
S = ET.tostring(convert_node(node))

# gives '<b> This is sentence. And this is one more. </b>'

Obviously, this will not cover every situation, but it will work if the task is constrained well enough. 显然,这并不能涵盖所有情况,但是如果任务受到足够的约束,它将可以正常工作。 This approach should be adaptable for your function that you already have. 这种方法应该适合于您已经拥有的功能。 Essentially, I believe you need to use a parser to parse the HTML and then manipulate the text values of each html node. 本质上,我认为您需要使用解析器来解析HTML,然后操纵每个html节点的文本值。

If you are reluctant to use a parser, use a regex. 如果您不愿意使用解析器,请使用正则表达式。 This is likely much more fragile, so you must constraint your inputs much more. 这可能更加脆弱,因此您必须限制输入更多。 Something like this as a start: 像这样的开始:

>>> split_str = re.split('(</?\w+>|\.)', S)
# split_str is ['', '<b>', 'this is a sentence', '.', ' and this is one more sentence', '</b>', '']

You can then just check if the words in the split string starts and ends with < and > 然后,您可以检查拆分字符串中的单词是否以<和>开头和结尾

for i, word in enumerate(split_str):
    if len(word) > 1 and not word.startswith('<') or not word.endswith('>'):
       split_str[i] = convert(word)

S = ' '.join(split_str)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM