Unable to force the removal of a directory

Question

I'm using the Info-ZIP utilities in a Ruby script on Windows 10 to unzip an archive, edit the contents, and rezip it. The script is meant to iterate over a batch of archives, and delete the temporary folder that is created when extracting the contents. The folder is not being deleted, though. For example:

archives.each { |archive|
    system("unzip.exe -o archive -d temp")
    [...]
    system("zip.exe -X0q archive .")
    FileUtils.rm_rf "temp"
}

This has always worked on a Mac just fine (using the same script, in conjunction with the zip/unzip commands), however, in Windows, I cannot get the temporary folder to be deleted. The unzipping and zipping process works fine, but the "temp" folder will not be deleted. This results in the unzipping utility throwing the same error: error: cannot delete old temp/[file] for every file that exists in the folder.

I've tried using system("del /Q temp") , which throws a Could Not Find: C:\[...]\temp error, even though the directory does exist. I tried system("rmdir /s /q temp") , which throws another error: The process cannot access the file because it is being used by another process. The only "process" using this file is the script itself, though.

Once the script is done running, if I run FileUtils.rm_rf "temp" afterwards, it then works, and successfully deletes the directory. However, I need this to be done after each iteration and within the same original script, so that the directory is correctly overwritten and deleted at the end of the execution, without any error or warning in Command Prompt.

Is there any other way to forcibly delete this folder?

Update: After doing a lot more testing of different parts of the script, I was able to locate the exact source of the problem. So all of the archives contain XHTML files. The script requires in some cases that an archive be duplicated, and the duplicated archive has its contents modified. Whether or not a duplicate needs to be made depends on the existence of certain markup within an XHTML file. The script uses Nokogiri to parse the content. It seems that the method of parsing through Nokogiri is what is triggering the issue. To simplify the code:

FileUtils.cp(original_archive,new_archive)
unzip_archive(new_archive) # a function to contain the unzipping steps
Dir.glob("temp/**/*.{html,xhtml}").each { |page|
        contents = Nokogiri::XML(open(page))
    }
zip_archive(new_archive)

In this example, nothing is actually happening, but just the presence of Nokogiri::XML(open(page)) is enough to trigger the errors. This happens for every page that is opened through Nokogiri. So if I change it to only one page:

contents = Nokogiri::XML(open(Dir.glob("temp/**/one_page.xhtml")))

then the FileUtils.rm_rf 'temp' successfully deletes the files in the temp folder except for one_page.xhtml , which throws the "cannot delete" error.

Is there a way to bypass this issue, such that I can still use Nokogiri in my Ruby script, but not have the script think the Nokogiri "process" is still running? This isspecific to Windows, since no such problems were encountered on Macs.

Answer 1

Looking at the code:

Dir.glob("temp/**/*.{html,xhtml}").each { |page|
        contents = Nokogiri::XML(open(page))
    }

the problem really looks like you're consuming all the available file handles. This isn't a Nokogiri problem at all, it just happened to be in town when the problem occurred.

OSes have a pool of file handles available; They're not an infinite resource. If you have a huge number of files that are being found, iterating over them and leaving them open, then you're consuming them all, which is poor programming.

Using the block form for File.open will work around the problem, but File.read without the block is cleaner, shorter and, in my opinion, a much better way to go.

Dir.glob("temp/**/*.{html,xhtml}").each { |page|
  contents = Nokogiri::XML(File.read(page))
  # do something with contents
}

But, using Dir.glob is also contributing to this, and another, problem. You're asking the system to search the disk to find all matching files, then return them as an array in memory, which are then iterated over. Instead, I highly recommend using Find , which is in Ruby's Standard Library. It behaves much better in that sort of situation.

The Find module supports the top-down traversal of a set of file paths.

For example, to total the size of all files under your home directory, ignoring anything in a “dot” directory (eg $HOME/.ssh):

require 'find'

total_size = 0

Find.find(ENV["HOME"]) do |path|
  if FileTest.directory?(path)
    if File.basename(path).start_with?('.')
      Find.prune       # Don't look any further into this directory.
    else
      next
    end
  else
    total_size += FileTest.size(path)
  end
end

Using Find you can run the code against a huge drive containing millions of matches and it'll perform better than Dir.glob .

Tweaking their example, this untested code should get you started:

require 'find'
require 'nokogiri'

Find.find('temp') do |path|
  if FileTest.file?(path) && path[/\.x?html$/i]
    contents = Nokogiri::XML(File.read(page))
    # do something with contents
  end
end

A second problem you'll often see using Dir.glob to do a top-down search ( ** ) is it'll immediately ask the OS to find all the matching files, then wait for the OS to gather them. If, instead, you use Find your code will pause for each search for the next match in the hierarchy, but it'll be a much shorter pause resulting in a more responsive application that doesn't eat as much memory or beat the disk gathering files. On a remotely mounted drive or a file server you could end up irritating your sysadmin when they notice huge.network and disk IO spikes instead of a minor increase in activity.

Unable to force the removal of a directory

Question

1 answers

solution1
0 2020-04-18 21:52:12

Unable to force the removal of a directory

Question

1 answers

solution1 0 2020-04-18 21:52:12

solution1
0 2020-04-18 21:52:12