Python Pandas 数据框：一列包含特殊的 HTML 特殊字符，例如 & < 有没有办法删除它们？

Question

I am only showing an example here.我在这里只展示一个例子。 Is there a way to remove all of the special characters?有没有办法删除所有特殊字符？ (eg. not just "&" and "<" shown) （例如，不只是显示"&" and "<" ）

Answer 1

I think the following would work with only one pass through the text我认为以下内容仅适用于文本

re.sub("&[a-zA-Z]+?;","",corpus_of_text)

in a dataframe i think its just (I think...)在 dataframe 我认为它只是（我认为......）

cleaned_values = df['column2'].str.replace(re.compile("&[a-zA-Z]+?;"),"")

Answer 2

found this https://gist.github.com/codeboy/5487eeb1c551d59e2366 which does slightly more than you're asking, so i modified it to this:发现这个https://gist.github.com/codeboy/5487eeb1c551d59e2366比你问的要多一些，所以我把它修改为：

import re

def parse_text(text, patterns=None): 
    """ 
    modified from above github gist
    delete all HTML entities 
    :param text (str): given text 
    :param patterns (dict): patterns for re.sub 
    :return str: final text 
    """ 
    base_patterns = {"&[rl]dquo;": "", 
                     "&[rl]squo;": "", 
                     "&nbsp;": "", 
                      "&amp;": ""} 
    patterns = patterns or base_patterns 
     
    final_text = text 
    for pattern, repl in patterns.items(): 
        final_text = re.sub(pattern, repl, final_text) 
    return final_text

you can call it like this, assigning to a new column so you can compare the result to the original string:您可以这样调用它，分配给一个新列，以便您可以将结果与原始字符串进行比较：

df["column3"] = df["column2"].apply(parse_text)

please note that the patterns variable is probably not complete, and you may have to augment it based on what you have in your escaped HTML.请注意， patterns变量可能不完整，您可能必须根据您在转义的 HTML 中的内容来增加它。

Python Pandas 数据框：一列包含特殊的 HTML 特殊字符，例如 & < 有没有办法删除它们？

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-01-27 02:59:35

解决方案2
0 2022-01-27 02:52:29

Python Pandas 数据框：一列包含特殊的 HTML 特殊字符，例如 &amp; &lt; 有没有办法删除它们？

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-01-27 02:59:35

解决方案2 0 2022-01-27 02:52:29

Python Pandas 数据框：一列包含特殊的 HTML 特殊字符，例如 & < 有没有办法删除它们？

解决方案1
1 已采纳 2022-01-27 02:59:35

解决方案2
0 2022-01-27 02:52:29