Pandas replace not replacing the whole string

Question

So I'm going through a text and I need to replace a bunch of CIDs (characters that were not readable when I scraped them). I need to replace every "cid:###" with the correct character. The issue that I'm currently running into is that some CIDs are wrapped around in <s></s> and there is no space between <s>(cid:131)</s> and the next word.

So, when I use replace, it doesn't work when I try to replace <s>(cid:131)</s> to ▪. When I try to replace cid:131 with ▪, I get <s>▪</s> . I'm trying to get rid of the <s></s> for this specific case ( <s></s> is found in other places in the document and I don't want to replace those).

Doesn't change anything:

csv_of_table = csv_of_table.replace('<s>(cid:131)</s>', '▪', regex=True)

Only changes the part with cid:131:

csv_of_table = csv_of_table.replace('cid:131', '▪', regex=True)

Answer 1

You can use the ? quantifier to signify that a group can appear 0 or multiple times.

csv_of_table = csv_of_table.replace("(<s>\()?cid:\d+(\)<\/s>)?", "▪", regex = True)

Pandas replace not replacing the whole string

Question

1 answers

solution1
1 ACCPTED 2020-02-27 18:37:24

Pandas replace not replacing the whole string

Question

1 answers

solution1 1 ACCPTED 2020-02-27 18:37:24

solution1
1 ACCPTED 2020-02-27 18:37:24