So I'm going through a text and I need to replace a bunch of CIDs (characters that were not readable when I scraped them). I need to replace every "cid:###" with the correct character. The issue that I'm currently running into is that some CIDs are wrapped around in <s></s>
and there is no space between <s>(cid:131)</s>
and the next word.
So, when I use replace, it doesn't work when I try to replace <s>(cid:131)</s>
to ▪. When I try to replace cid:131 with ▪, I get <s>▪</s>
. I'm trying to get rid of the <s></s>
for this specific case ( <s></s>
is found in other places in the document and I don't want to replace those).
Doesn't change anything:
csv_of_table = csv_of_table.replace('<s>(cid:131)</s>', '▪', regex=True)
Only changes the part with cid:131:
csv_of_table = csv_of_table.replace('cid:131', '▪', regex=True)
You can use the ? quantifier to signify that a group can appear 0 or multiple times.
csv_of_table = csv_of_table.replace("(<s>\()?cid:\d+(\)<\/s>)?", "▪", regex = True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.