简体   繁体   English

如何使用python从内联样式标签中删除特定的值对?

[英]How can I remove specific value pairs from an inline style tag using python?

i am trying to parse some html that has some nasty inline styling. 我试图解析一些具有令人讨厌的内联样式的html。 it looks something like this 看起来像这样

<span class="text_line" data-complex="0" data-endposition="4:2:86:5:0" data-position="4:2:74:2:0" style="font-family: scala-sans-offc-pro--; width: 100%; word-spacing: -2.66667px; font-size: 24px !important; line-height: 40px; font-variant-ligatures: common-ligatures; display: block; height: 40px; margin-left: 75px; margin-right: 155px;">

I am trying to remove just the attribute-value pair word-spacing: -2.66667px; 我正在尝试仅删除属性值对word-spacing: -2.66667px; . Here is the catch there are several hundred of these lines and no two are the same. 这里有几百条这些线,没有两条是相同的。 Sometimes the spacing is word-spacing: -4px and sometimes word-spacing: -3.78632px; 有时间距是word-spacing: -4px ,有时是word-spacing: -3.78632px; or some other random number. 或其他一些随机数。

I tried the beautiful soup, I figured out how to remove the whole tag, which is not what I wanted. 我尝试了美丽的汤,我想出了如何去除整个标签,这不是我想要的。 I don't know how to do it with regular expressions. 我不知道如何使用正则表达式。 And I read that it's better to avoid trying to edit HTML with regex. 而且我读到最好避免尝试使用正则表达式编辑HTML。

My idea right constitutes saving all the span tags to a variable using beautiful soup and then using string.find() to get the indexes of all the "w"'s in word-spacing and then finding the next semi column. 我的想法正确,包括使用漂亮的汤将所有span标签保存到变量中,然后使用string.find()获取单词间距中所有“ w”的索引,然后找到下一个半栏。 Then after I have a list find a way to cut the string at those indexes and join the remnants back together. 然后,在获得列表之后,找到一种方法来在这些索引处剪切字符串并将剩余部分重新组合在一起。 Maybe splitting at the ";" 也许在“;”处分裂 is better... I don't know any more at this point. 更好...目前我不知道了。 The brain is a fried and tired. 大脑是油炸和疲倦的。 :P :P

    def __init__(self, first_index, last_index):
        self.first = first_index
        self.last = last_index
def getIndices(text, start_index):
    index = CutPointIndex(None, None)
    index.first = text.find("word-spacing", start_index, end_index)
    if(index.first != -1):
        index.last = text.find(";", index.first , end_index)
    return index

Given something like style="font-family: scala-sans-offc-pro--; width: 100%; word-spacing: -3.71429px;" 给定类似style="font-family: scala-sans-offc-pro--; width: 100%; word-spacing: -3.71429px;"

or style="font-family: scala-sans-offc-pro--; width: 100%; word-spacing: -5px; style="font-family: scala-sans-offc-pro--; width: 100%; word-spacing: -5px;

or any other variation of values the expected outcome should be style="font-family: scala-sans-offc-pro--; width: 100%; 或任何其他值的变化形式,预期结果应为style="font-family: scala-sans-offc-pro--; width: 100%;

I'm guessing that maybe you might want to re.sub the variable word-spacing : 我猜,也许你可能要re.sub可变word-spacing

import re

regex = r"\s*word-spacing\s*:\s*[^;]*;"

test_str = '''
style="font-family: scala-sans-offc-pro--; width: 100%; word-spacing: -3.71429px;"
style="font-family: scala-sans-offc-pro--; width: 100%; word-spacing: -5px;"
style="font-family: scala-sans-offc-pro--; width: 100%;"

'''

print(re.sub(regex, "", test_str))

Output 输出量

style="font-family: scala-sans-offc-pro--; width: 100%;"
style="font-family: scala-sans-offc-pro--; width: 100%;"
style="font-family: scala-sans-offc-pro--; width: 100%;"

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com . 如果您想浏览/简化/修改该表达式,请在regex101.com的右上方面板中进行说明 If you'd like, you can also watch in this link , how it would match against some sample inputs. 如果愿意,您还可以在此链接中观看,它将如何与某些示例输入匹配。


You could match on elements with that attribute and remove that part. 您可以匹配具有该属性的元素,然后删除该部分。

I split the style attribute (for the relevant tags only) on ; 我在上分割了style属性(仅用于相关标签) ; then recombine excluding the unwanted pair 然后重组排除不想要的对

';'.join([i for i in t['style'].split(';') if 'word-spacing' not in i])

but you could just as easily update the value for word-spacing 但您可以轻松地更新word-spacing的值

from bs4 import BeautifulSoup as bs

html = '''
<span class="text_line" data-complex="0" data-endposition="4:2:86:5:0" data-position="4:2:74:2:0" style="font-family: scala-sans-offc-pro--; width: 100%; word-spacing: -2.66667px; font-size: 24px !important; line-height: 40px; font-variant-ligatures: common-ligatures; display: block; height: 40px; margin-left: 75px; margin-right: 155px;">
'''
soup = bs(html, 'lxml')

for t in soup.select('[style*= word-spacing]'):
    t['style'] = ';'.join([i for i in t['style'].split(';') if 'word-spacing' not in i])
print(soup)

Reading: 读:

  1. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
  2. https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors https://developer.mozilla.org/zh-CN/docs/Web/CSS/Attribute_selectors

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM