简体   繁体   中英

Java: Watching a directory to move large files

I have been writing a program that watches a directory and when files are created in it, it changes the name and moves them to a new directory. In my first implementation I used Java's Watch Service API which worked fine when I was testing 1kb files. The problem that came up is that in reality the files getting created are anywhere from 50-300mb. When this happened the watcher API would find the file right away but could not move it because it was still being written. I tried putting the watcher in a loop (which generated exceptions until the file could be moved) but this seemed pretty inefficient.

Since that didn't work, I tried up using a timer that checks the folder every 10s and then moves files when it can. This is the method I ended up going for.

Question: Is there anyway to signal when a file is done being written without doing an exception check or continually comparing the size? I like the idea of using the Watcher API just once for each file instead of continually checking with a timer (and running into exceptions).

All responses are greatly appreciated!

nt

I ran into the same problem today. I my usecase a small delay before the file is actually imported was not a big problem and I still wanted to use the NIO2 API. The solution I choose was to wait until a file has not been modified for 10 seconds before performing any operations on it.

The important part of the implementation is as follows. The program waits until the wait time expires or a new event occures. The expiration time is reset every time a file is modified. If a file is deleted before the wait time expires it is removed from the list. I use the poll method with a timeout of the expected expirationtime, that is (lastmodified+waitTime)-currentTime

private final Map<Path, Long> expirationTimes = newHashMap();
private Long newFileWait = 10000L;

public void run() {
    for(;;) {
        //Retrieves and removes next watch key, waiting if none are present.
        WatchKey k = watchService.take();

        for(;;) {
            long currentTime = new DateTime().getMillis();

            if(k!=null)
                handleWatchEvents(k);

            handleExpiredWaitTimes(currentTime);

            // If there are no files left stop polling and block on .take()
            if(expirationTimes.isEmpty())
                break;

            long minExpiration = min(expirationTimes.values());
            long timeout = minExpiration-currentTime;
            logger.debug("timeout: "+timeout);
            k = watchService.poll(timeout, TimeUnit.MILLISECONDS);
        }
    }
}

private void handleExpiredWaitTimes(Long currentTime) {
    // Start import for files for which the expirationtime has passed
    for(Entry<Path, Long> entry : expirationTimes.entrySet()) {
        if(entry.getValue()<=currentTime) {
            logger.debug("expired "+entry);
            // do something with the file
            expirationTimes.remove(entry.getKey());
        }
    }
}

private void handleWatchEvents(WatchKey k) {
    List<WatchEvent<?>> events = k.pollEvents();
    for (WatchEvent<?> event : events) {
        handleWatchEvent(event, keys.get(k));
    }
    // reset watch key to allow the key to be reported again by the watch service
    k.reset();
}

private void handleWatchEvent(WatchEvent<?> event, Path dir) throws IOException {
    Kind<?> kind = event.kind();

    WatchEvent<Path> ev = cast(event);
        Path name = ev.context();
        Path child = dir.resolve(name);

    if (kind == ENTRY_MODIFY || kind == ENTRY_CREATE) {
        // Update modified time
        FileTime lastModified = Attributes.readBasicFileAttributes(child, NOFOLLOW_LINKS).lastModifiedTime();
        expirationTimes.put(name, lastModified.toMillis()+newFileWait);
    }

    if (kind == ENTRY_DELETE) {
        expirationTimes.remove(child);
    }
}

Write another file as an indication that the original file is completed. Ig 'fileorg.dat' is growing if done create a file 'fileorg.done' and check only for the 'fileorg.done'.

With clever naming conventions you should not have problems.

Two solutions:

The first is a slight variation of the answer by stacker :

Use a unique prefix for incomplete files. Something like myhugefile.zip.inc instead of myhugefile.zip . Rename the files when upload / creation is finished. Exclude .inc files from the watch.

The second is to use a different folder on the same drive to create / upload / write the files and move them to the watched folder once they are ready. Moving should be an atomic action if they are on the same drive (file system dependent, I guess).

Either way, the clients that create the files will have to do some extra work.

I know it's an old question but maybe it can help somebody.

I had the same issue, so what I did was the following:

if (kind == ENTRY_CREATE) {
            System.out.println("Creating file: " + child);

            boolean isGrowing = false;
            Long initialWeight = new Long(0);
            Long finalWeight = new Long(0);

            do {
                initialWeight = child.toFile().length();
                Thread.sleep(1000);
                finalWeight = child.toFile().length();
                isGrowing = initialWeight < finalWeight;

            } while(isGrowing);

            System.out.println("Finished creating file!");

        }

When the file is being created, it will be getting bigger and bigger. So what I did was to compare the weight separated by a second. The app will be in the loop until both weights are the same.

Looks like Apache Camel handles the file-not-done-uploading problem by trying to rename the file (java.io.File.renameTo). If the rename fails, no read lock, but keep trying. When the rename succeeds, they rename it back, then proceed with intended processing.

See operations.renameFile below. Here are the links to the Apache Camel source: GenericFileRenameExclusiveReadLockStrategy.java and FileUtil.java

public boolean acquireExclusiveReadLock( ... ) throws Exception {
   LOG.trace("Waiting for exclusive read lock to file: {}", file);

   // the trick is to try to rename the file, if we can rename then we have exclusive read
   // since its a Generic file we cannot use java.nio to get a RW lock
   String newName = file.getFileName() + ".camelExclusiveReadLock";

   // make a copy as result and change its file name
   GenericFile<T> newFile = file.copyFrom(file);
   newFile.changeFileName(newName);
   StopWatch watch = new StopWatch();

   boolean exclusive = false;
   while (!exclusive) {
        // timeout check
        if (timeout > 0) {
            long delta = watch.taken();
            if (delta > timeout) {
                CamelLogger.log(LOG, readLockLoggingLevel,
                        "Cannot acquire read lock within " + timeout + " millis. Will skip the file: " + file);
                // we could not get the lock within the timeout period, so return false
                return false;
            }
        }

        exclusive = operations.renameFile(file.getAbsoluteFilePath(), newFile.getAbsoluteFilePath());
        if (exclusive) {
            LOG.trace("Acquired exclusive read lock to file: {}", file);
            // rename it back so we can read it
            operations.renameFile(newFile.getAbsoluteFilePath(), file.getAbsoluteFilePath());
        } else {
            boolean interrupted = sleep();
            if (interrupted) {
                // we were interrupted while sleeping, we are likely being shutdown so return false
                return false;
            }
        }
   }

   return true;
}

While it's not possible to be notificated by the Watcher Service API when the SO finish copying, all options seems to be 'work around' (including this one!).

As commented above,

1) Moving or copying is not an option on UNIX;

2) File.canWrite always returns true if you have permission to write, even if the file is still being copied;

3) Waits until the a timeout or a new event occurs would be an option, but what if the system is overloaded but the copy was not finished? if the timeout is a big value, the program would wait so long.

4) Writing another file to 'flag' that the copy finished is not an option if you are just consuming the file, not creating.

An alternative is to use the code below:

boolean locked = true;

while (locked) {
    RandomAccessFile raf = null;
    try {
            raf = new RandomAccessFile(file, "r"); // it will throw FileNotFoundException. It's not needed to use 'rw' because if the file is delete while copying, 'w' option will create an empty file.
            raf.seek(file.length()); // just to make sure everything was copied, goes to the last byte
            locked = false;
        } catch (IOException e) {
            locked = file.exists();
            if (locked) {
                System.out.println("File locked: '" + file.getAbsolutePath() + "'");
                Thread.sleep(1000); // waits some time
            } else { 
                System.out.println("File was deleted while copying: '" + file.getAbsolutePath() + "'");
            }
    } finally {
            if (raf!=null) {
                raf.close();    
            }
        }
}

Depending on how urgently you need to move the file once it is done being written, you can also check for a stable last-modified timestamp and only move the file one it is quiesced. The amount of time you need it to be stable can be implementation dependent, but I would presume that something with a last-modified timestamp that hasn't changed for 15 secs should be stable enough to be moved.

For large file in linux, the files gets copied with a extension of .filepart. You just need to check the extension using commons api and register the ENTRY_CREATE event. I tested this with my .csv files(1GB) and add it worked

public void run()
{
    try
    {
        WatchKey key = myWatcher.take();
        while (key != null)
        {
            for (WatchEvent event : key.pollEvents())
            {
                if (FilenameUtils.isExtension(event.context().toString(), "filepart"))
                {
                    System.out.println("Inside the PartFile " + event.context().toString());
                } else
                {
                    System.out.println("Full file Copied " + event.context().toString());
                    //Do what ever you want to do with this files.
                }
            }
            key.reset();
            key = myWatcher.take();
        }
    } catch (InterruptedException e)
    {
        e.printStackTrace();
    }
}

If you don't have control over the write process, log all ENTRY_CREATED events and observe if there are patterns .

In my case, the files are created via WebDav (Apache) and a lot of temporary files are created but also two ENTRY_CREATED events are triggered for the same file. The second ENTRY_CREATED event indicates that the copy process is complete.

Here are my example ENTRY_CREATED events. The absolute file path is printed (your log may differ, depending on the application that writes the file):

[info] application - /var/www/webdav/.davfs.tmp39dee1 was created
[info] application - /var/www/webdav/document.docx was created
[info] application - /var/www/webdav/.davfs.tmp054fe9 was created
[info] application - /var/www/webdav/document.docx was created
[info] application - /var/www/webdav/.DAV/__db.document.docx was created 

As you see, I get two ENTRY_CREATED events for document.docx . After the second event I know the file is complete. Temporary files are obviously ignored in my case.

This is a very interesting discussion, as certainly this is a bread and butter use case: wait for a new file to be created and then react to the file in some fashion. The race condition here is interesting, as certainly the high-level requirement here is to get an event and then actually obtain (at least) a read lock on the file. With large files or just simply lots of file creations, this could require a whole pool of worker threads that just periodically try to get locks on newly created files and, when they're successful, actually do the work. But as I am sure NT realizes, one would have to do this carefully to make it scale as it is ultimately a polling approach, and scalability and polling aren't two words that go together well.

So, I had the same problem and had the following solution work for me. Earlier unsuccessful attempt - Trying to monitor the "lastModifiedTime" stat of each file but I noticed that a large file's size growth may pause for some time.(size does not change continuously)

Basic Idea - For every event, create a trigger file(in a temporary directory) whose name is of the following format -

OriginalFileName_lastModifiedTime_numberOfTries

This file is empty and all the play is only in the name. The original file will only be considered after passing intervals of a specific duration without a change in it's "last Modified time" stat. (Note - since it's a file stat, there's no overhead -> O(1))

NOTE - This trigger file is handled by a different service(say ' FileTrigger ').

Advantage -

  1. No sleep or wait to hold the system.
  2. Relieves the file watcher to monitor other events

CODE for FileWatcher -

val triggerFileName: String = triggerFileTempDir + orifinalFileName + "_" + Files.getLastModifiedTime(Paths.get(event.getFile.getName.getPath)).toMillis + "_0"

// creates trigger file in temporary directory
val triggerFile: File = new File(triggerFileName)
val isCreated: Boolean = triggerFile.createNewFile()

if (isCreated)
    println("Trigger created: " + triggerFileName)
else
    println("Error in creating trigger file: " + triggerFileName)

CODE for FileTrigger (cron job of interval say 5 mins) -

 val actualPath : String = "Original file directory here"
 val tempPath : String = "Trigger file directory here"
 val folder : File = new File(tempPath)    
 val listOfFiles = folder.listFiles()

for (i <- listOfFiles)
{

    // ActualFileName_LastModifiedTime_NumberOfTries
    val triggerFileName: String = i.getName
    val triggerFilePath: String = i.toString

    // extracting file info from trigger file name
    val fileInfo: Array[String] = triggerFileName.split("_", 3)
    // 0 -> Original file name, 1 -> last modified time, 2 -> number of tries

    val actualFileName: String = fileInfo(0)
    val actualFilePath: String = actualPath + actualFileName
    val modifiedTime: Long = fileInfo(1).toLong
    val numberOfTries: Int = fileStats(2).toInt

    val currentModifiedTime: Long = Files.getLastModifiedTime(Paths.get(actualFilePath)).toMillis
    val differenceInModifiedTimes: Long = currentModifiedTime - modifiedTime
    // checks if file has been copied completely(4 intervals of 5 mins each with no modification)
    if (differenceInModifiedTimes == 0 && numberOfTries == 3)
    {
        FileUtils.deleteQuietly(new File(triggerFilePath))
        println("Trigger file deleted. Original file completed : " + actualFilePath)
    }
    else
    {
        var newTriggerFileName: String = null
        if (differenceInModifiedTimes == 0)
        {
            // updates numberOfTries by 1
            newTriggerFileName = actualFileName + "_" + modifiedTime + "_" + (numberOfTries + 1)
        }
        else
        {
            // updates modified timestamp and resets numberOfTries to 0
            newTriggerFileName = actualFileName + "_" + currentModifiedTime + "_" + 0
        }

        // renames trigger file
        new File(triggerFilePath).renameTo(new File(tempPath + newTriggerFileName))
        println("Trigger file renamed: " + triggerFileName + " -> " + newTriggerFileName)
    }    
}

I had to deal with a similar situation when I implemented a file system watcher to transfer uploaded files. The solution I implemented to solve this problem consists of the following:

1- First of all, maintain a Map of unprocessed file (As long as the file is still being copied, the file system generates Modify_Event, so you can ignore them if the flag is false).

2- In your fileProcessor, you pickup a file from the list and check if it's locked by the filesystem, if yes, you will get an exception, just catch this exception and put your thread in wait state (ie 10 seconds) and then retry again till the lock is released. After processing the file, you can either change the flag to true or remove it from the map.

This solution will be not be efficient if the many versions of the same file are transferred during the wait timeslot.

Cheers, Ramzi

我推测java.io.File.canWrite()会告诉你文件写完的时间。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM