简体   繁体   中英

Pandas replace not replacing the whole string

So I'm going through a text and I need to replace a bunch of CIDs (characters that were not readable when I scraped them). I need to replace every "cid:###" with the correct character. The issue that I'm currently running into is that some CIDs are wrapped around in <s></s> and there is no space between <s>(cid:131)</s> and the next word.

So, when I use replace, it doesn't work when I try to replace <s>(cid:131)</s> to ▪. When I try to replace cid:131 with ▪, I get <s>▪</s> . I'm trying to get rid of the <s></s> for this specific case ( <s></s> is found in other places in the document and I don't want to replace those).

Doesn't change anything:

csv_of_table = csv_of_table.replace('<s>(cid:131)</s>', '▪', regex=True)

Only changes the part with cid:131:

csv_of_table = csv_of_table.replace('cid:131', '▪', regex=True)

You can use the ? quantifier to signify that a group can appear 0 or multiple times.

csv_of_table = csv_of_table.replace("(<s>\()?cid:\d+(\)<\/s>)?", "▪", regex = True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM