简体   繁体   English

Python Pandas 数据框:一列包含特殊的 HTML 特殊字符,例如 & < 有没有办法删除它们?

[英]Python Pandas Data Frame: One column contains special HTML spcial characters such as & < Is there a way to remove them?

示例数据框

在此处输入图像描述

I am only showing an example here.我在这里只展示一个例子。 Is there a way to remove all of the special characters?有没有办法删除所有特殊字符? (eg. not just "&" and "<" shown) (例如,不只是显示"&" and "<"

I think the following would work with only one pass through the text我认为以下内容仅适用于文本

re.sub("&[a-zA-Z]+?;","",corpus_of_text)

in a dataframe i think its just (I think...)在 dataframe 我认为它只是(我认为......)

cleaned_values = df['column2'].str.replace(re.compile("&[a-zA-Z]+?;"),"")

found this https://gist.github.com/codeboy/5487eeb1c551d59e2366 which does slightly more than you're asking, so i modified it to this:发现这个https://gist.github.com/codeboy/5487eeb1c551d59e2366比你问的要多一些,所以我把它修改为:

import re

def parse_text(text, patterns=None): 
    """ 
    modified from above github gist
    delete all HTML entities 
    :param text (str): given text 
    :param patterns (dict): patterns for re.sub 
    :return str: final text 
    """ 
    base_patterns = {"&[rl]dquo;": "", 
                     "&[rl]squo;": "", 
                     " ": "", 
                      "&": ""} 
    patterns = patterns or base_patterns 
     
    final_text = text 
    for pattern, repl in patterns.items(): 
        final_text = re.sub(pattern, repl, final_text) 
    return final_text

you can call it like this, assigning to a new column so you can compare the result to the original string:您可以这样调用它,分配给一个新列,以便您可以将结果与原始字符串进行比较:

df["column3"] = df["column2"].apply(parse_text)

please note that the patterns variable is probably not complete, and you may have to augment it based on what you have in your escaped HTML.请注意, patterns变量可能不完整,您可能必须根据您在转义的 HTML 中的内容来增加它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 删除特殊字符python数据框 - Remove special characters python data frame 如何仅删除数据框中一列的特殊字符? - How can I remove special characters for just one column in a data frame? 在 Pandas 数据框列中删除标点符号和特殊字符的更快方法 - Faster way to remove punctuations and special characters in pandas dataframe column 从 Pandas 的列中的行中删除特殊字符的大多数 Pythonic 方法 - Most Pythonic way to remove special characters from rows in a column in Pandas 处理熊猫数据框列名称中的特殊字符 - Dealing with special characters in pandas Data Frame´s column Name 选择数据框熊猫python的一列 - select one column of a data frame pandas python 当 web 与 Python 一起抓取时,如何删除 pandas 数据帧中的字符? - How to remove characters in pandas data frame when web scraping with Python? 从 pandas 的列中删除特殊字符 - remove special characters from a column in pandas 如何转换 pandas 数据框中的列,该列的值中有特殊字符,也没有作为值? - How to convert a column in pandas data frame which has special characters in its values and also None as a value? Python Pandas:将具有列名的数据框列合并为一列 - Python Pandas: Merge Columns of Data Frame with column name into one column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM