Here is my html file:
<head>
<title>Reading from text files</title>
</head>
<body>
<h3>Starting space</h3>
<ul>
<li></li>
</ul>
<h3>ending space</h3>
<ul>
</body>
</html>
I want to edit this html file using tcl, and regex. But I want to edit this at specific location, that is between Starting space
and ending space
. In between these points, I want to add various list items.
<li> First </li>
etc.. I wrote the tcl script to open this file and tried to print out the data between these two locations so that I can later edit it. But I am not able to do that. Can you please point me where am i going wrong?
Tcl script
proc edit_html {release} {
set fp [open $release r]
set para [read -nonewline $fp]
close $fp
set line_read [regexp -nocase -lineanchor -inline -all -- {^\s*?Starting space\s*?.*?ending space} $para]
foreach line_read $line_read {
regexp -nocase -- {^\s*?Starting space\s*?.*?ending space} $line_read - tag value
puts $value
}
}
edit_html [lindex $argv 0]
I am not sure where am I going wrong in this regexp. And once I find the location, how should I edit it? Any headsup? Like I should bring the file pointer thr?
The first problem with your current code is that you are not modifying anything. regexp
is used to read/fetch data, not make changes. You might want to use regsub
instead. Now the problem is, if you want to change that from the original file, and you have many Starting space
and ending space
, you might want to use a function with it.
Second, your regex doesn't match. You don't have ^\\s*?Starting space
but you have ^<h3>Starting space
and there are other parts you need to edit in that regex too.
I've written up the below proc:
proc edit_html {release} {
proc re_sub {block} {
# Get the items to b replaced
global items
# Get the indentation and put in $spaces
regexp -lineanchor -- {^(\s*)<li>} $block - spaces
set html_items [list]
foreach item $items {
lappend html_items "<li>$item</li>"
}
# Create the list of items in html form with indentation
set html_items [join $html_items "\n$spaces"]
regsub -lineanchor -- {<li>\s*</li>} $block $html_items result
return $result
}
set fp [open $release r]
set para [read -nonewline $fp]
close $fp
# The command to be executed
set cmd {[re_sub "\0"]}
# The substitution
set result [subst [regsub -all -- {^<h3>Starting space</h3>\s*?.*?\s*?<h3>ending space} $para $cmd]]
return $result
}
# The items to insert
set items [list First Second Third]
edit_html [lindex $argv 0]
With a defined list named items
containing First Second Third
, you get this as output:
<head>
<title>Reading from text files</title>
</head>
<body>
<h3>Starting space</h3>
<ul>
<li>First</li>
<li>Second</li>
<li>Third</li>
</ul>
<h3>ending space</h3>
<ul>
</body>
</html>
To edit a text file, you need to load it into memory and then write it out again afterwards; you can't stream with writing back to the same file. Where you can write an easy way to select the text to be replaced directly, you can use a regsub
for the core of it, but that's not possible here as you are matching text on either side of the area to match. Thus, for the sort of edit you are looking at, what you need is the index into the string (ie, the content of the file) that indicates where the first character to be replaced is, and the index of the last character to be replaced.
Fortunately, getting the indices is easy. Either you use regexp -indices
or you use string first
/ string last
.
# Read the file; standard stanza
set f [open $theFilename]
set data [read $f]
close $f
# Find the markers
regexp -indices {<h3>Starting space</h3>\n<ul>\n} $data start
regexp -indices {\n</ul>\n<h3>ending space</h3>} $data end
# We now need to offset the ends by one in each direction (we want stuff between)
set start [expr {[lindex $start 1] + 1}]
set end [expr {[lindex $end 0] - 1}]
# Now we can generate the replacement...
set replacement ""
foreach item ... {
append replacement "<li>...</li>\n"
}
# ... and insert it
set data [string replace $data $start $end $replacement]
# ... and write it out (without the extra newline; we've enough already)
set f [open $theFilename "w"]
puts -nonewline $f $data
close $f
Alternatively, you could instead do the replacement as you write things back to the file.
# Read the file; standard stanza
set f [open $theFilename]
set data [read $f]
close $f
# Find the markers
regexp -indices {<h3>Starting space</h3>\n<ul>\n} $data start
regexp -indices {\n</ul>\n<h3>ending space</h3>} $data end
# Generate the replacement text
set replacement ""
foreach item ... {
append replacement "<li>...</li>\n"
}
# Write everything out
set f [open $theFilename "w"]
puts -nonewline $f [string range $data 0 [lindex $start 1]]
puts -nonewline $f $replacement
puts -nonewline $f [string range $data [lindex $end 0] end]
close $f
You've received many good answers already, I'd just like to point out that parsing HTML with regular expressions can be tricky and error-prone. The tDOM package makes this a breeze, however.
You do need well-formed HTML (it doesn't have to be XHTML-level well-formed, though), so I'll add a starting tag for the html
element. I'll also remove the empty li
element inside the relevant ul
, not because tDOM needs that but because it makes my solution a bit simpler:
<html>
<head>
<title>Reading from text files</title>
</head>
<body>
<h3>Starting space</h3>
<ul>
</ul>
<h3>ending space</h3>
<ul>
</ul>
</body>
</html>
Put this in a variable any way you prefer, eg by reading it from a file:
set f [open foo.html] ; set html [read -nonewline $f] ; close $f
Create a document object and find the root node:
set doc [dom parse -html $html]
set root [$doc documentElement]
Find the node that you want to insert into: it's the first ul
element that follows an h3
element that has a text node with the value "Starting space"
.
set xpath {//h3[contains(text(), 'Starting space')]/following-sibling::ul[1]}
lassign [$root selectNodes $xpath] node
Insert the items into this node. It's probably best to have a command for that:
proc addItem {doc node txt} {
set li [$doc createElement li]
$li appendChild [$doc createTextNode $txt]
$node appendChild $li
}
Now do it:
addItem $doc $node "First item"
Write back the changed document to the original file or to another file:
set f [open bar.html w] ; $root asHTML -channel $f ; close $f
(Note that the asHTML
does not prettify or preserve the formatting of the original HTML.)
Finally clean up data structures and commands created by deleting the document object:
$doc delete
Aside:
If are allowed to change the structure of the original HTML you can make this a bit easier and safer by adding an id
attribute to the element you want to insert into. If your HTML has this:
<ul id="insertitemshere">
the xpath
becomes something like
set xpath {//ul[@id='insertitemshere']}
The tDOM package is documented here: http://tdom.github.io/ . It's included in the ActiveState Tcl distribution and is documented there as well.
Here is my solution:
proc edit_html {release} {
set f [open $release]
while {[gets $f line] != -1} {
if {[string match "*Starting space*" $line]} {
puts "FANCY LIST"; # Replace with your fancy list
# Skip to the ending space
while {![string match "*ending space*" $line]} {
gets $f line
}
} else {
puts $line
}
}
close $f
}
I am writing the output to the console, but you can choose to write it out to a file.
If you want to add a list of items into a tcl template, a more idiomatic way to do it would be to create the template as a string, and use tcl's substitution mechanism to populate it.
set template {
<head>
<title>Reading from text files</title>
</head>
<body>
<h3>Starting space</h3>
<ul>
[get_listitems]
</ul>
<h3>ending space</h3>
<ul>
</body>
</html>
}
set items {First Second Third}
proc get_listitems {} {
global items
set s ""
foreach i $items {
append s "<li>$i</li>"
}
return $s
}
subst $template
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.