简体   繁体   中英

Need help creating a regex or script to run on html file

So I have this index that i am working on but i really find it a hassle to go in by hand and cross-link everything. I know a little bit about regexps and a little perl. here is what the html looks like

cf. <i>Penitencia y Reconciliaci&oacute;n</i>

but sometimes there is an instance of this

cf. <i>Advenimiento, Consumaci&oacute;n, Expectaci&oacute;n</i>;

I ran this regex on it:

cf\. <i>([^,]+,)</i>

but my goal is to be able to run a regex that will wrap around one or multiple words and then copy the inner Html of the "phrase" and paste it inside a anchor tag something like this

cf. <i><a href="#Penitencia y Reconciliaci&oacute;n">Penitencia y Reconciliaci&oacute;n</a></i>

which i was able to accomplish with the regex above; but the problem is that my regex is does not take into consideration that there might be two "phrases" that it needs to wrap itself around. So my whole goal is to end up with this:

cf. <i><a href="#Advenimiento">Advenimiento</a>, <a href="#Consumaci&oacute;n">Consumaci&oacute;n</a>, <a href="#Expectaci&oacute;n">Expectaci&oacute;n</a></i>;

any help would be really appreciated

In the context of creating a program to automate this, the better, harder, faster, stronger solution would - I agree with the comment to the OP - be to use the DOM to look up/parse/query tags, get the values, then modify and rewrite them. I'm assuming from your specific example that this is a one-off find-and-replace, or something you don't mind -running a replace manually every once in a while...

A Perl s//-expression (I guess p!!-expression in this case), which was only tested in an emulator:

s!(?<=,)(\s?)([^<,]+)(?=,|</i>)|(?<=<i>)([^<,]+)(?=,|</i>)!$1<a href="#$2$3">$2$3</a>!i

Bear in mind that, as written, this will only match items enclosed within <i> tags and of course is not tolerant of other tags in between them - just a few of the reasons you should not put this into program code...

The expression turns this HTML:

Parte del texto inicial. <i>Penitencia y Reconciliaci&oacute;n</i> 
<i>Advenimiento, Consumaci&oacute;n, Expectaci&oacute;n</i>; Otro texto que <em>no es especial</em> ... <i>Otra etiqueta que debe estar vinculada</i>
Otra l&iacute;nea <i>con un enlace</i> y un texto m&aacute;s.

into this text:

Parte del texto inicial. <i><a href="#Penitencia y Reconciliaci&oacute;n">Penitencia y Reconciliaci&oacute;n</a></i> 
<i><a href="#Advenimiento">Advenimiento</a>, <a href="#Consumaci&oacute;n">Consumaci&oacute;n</a>, <a href="#Expectaci&oacute;n">Expectaci&oacute;n</a></i>; Otro texto que <em>no es especial</em> ... <i><a href="#Otra etiqueta que debe estar vinculada">Otra etiqueta que debe estar vinculada</a></i>
Otra l&iacute;nea <i><a href="#con un enlace">con un enlace</a></i> y un texto m&aacute;s.

As a side note, Your question is rather hard to read, and probably should have been tagged [perl] as well; this probably contributed significantly to it not being answered for a while... but better late than never!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM