[英]Python Pandas Data Frame: One column contains special HTML spcial characters such as & < Is there a way to remove them?
I think the following would work with only one pass through the text我认为以下内容仅适用于文本
re.sub("&[a-zA-Z]+?;","",corpus_of_text)
in a dataframe i think its just (I think...)在 dataframe 我认为它只是(我认为......)
cleaned_values = df['column2'].str.replace(re.compile("&[a-zA-Z]+?;"),"")
found this https://gist.github.com/codeboy/5487eeb1c551d59e2366 which does slightly more than you're asking, so i modified it to this:发现这个https://gist.github.com/codeboy/5487eeb1c551d59e2366比你问的要多一些,所以我把它修改为:
import re
def parse_text(text, patterns=None):
"""
modified from above github gist
delete all HTML entities
:param text (str): given text
:param patterns (dict): patterns for re.sub
:return str: final text
"""
base_patterns = {"&[rl]dquo;": "",
"&[rl]squo;": "",
" ": "",
"&": ""}
patterns = patterns or base_patterns
final_text = text
for pattern, repl in patterns.items():
final_text = re.sub(pattern, repl, final_text)
return final_text
you can call it like this, assigning to a new column so you can compare the result to the original string:您可以这样调用它,分配给一个新列,以便您可以将结果与原始字符串进行比较:
df["column3"] = df["column2"].apply(parse_text)
please note that the patterns
variable is probably not complete, and you may have to augment it based on what you have in your escaped HTML.请注意, patterns
变量可能不完整,您可能必须根据您在转义的 HTML 中的内容来增加它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.