简体   繁体   English

Java:查看目录以移动大文件

[英]Java: Watching a directory to move large files

I have been writing a program that watches a directory and when files are created in it, it changes the name and moves them to a new directory. 我一直在编写一个监视目录的程序,当在其中创建文件时,它会更改名称并将它们移动到新目录。 In my first implementation I used Java's Watch Service API which worked fine when I was testing 1kb files. 在我的第一个实现中,我使用了Java的Watch Service API,当我测试1kb文件时,它运行良好。 The problem that came up is that in reality the files getting created are anywhere from 50-300mb. 出现的问题是,实际上创建的文件是50-300mb。 When this happened the watcher API would find the file right away but could not move it because it was still being written. 当发生这种情况时,观察者API会立即找到该文件,但由于它仍在被写入,因此无法移动它。 I tried putting the watcher in a loop (which generated exceptions until the file could be moved) but this seemed pretty inefficient. 我尝试将观察者放在循环中(生成异常,直到文件可以移动),但这看起来效率很低。

Since that didn't work, I tried up using a timer that checks the folder every 10s and then moves files when it can. 由于这不起作用,我尝试使用定时器,每隔10秒检查一次文件夹,然后尽可能移动文件。 This is the method I ended up going for. 这是我最终选择的方法。

Question: Is there anyway to signal when a file is done being written without doing an exception check or continually comparing the size? 问题:无论如何在没有进行异常检查或不断比较大小的情况下发出文件写入信号? I like the idea of using the Watcher API just once for each file instead of continually checking with a timer (and running into exceptions). 我喜欢为每个文件使用Watcher API一次,而不是使用计时器不断检查(并运行异常)。

All responses are greatly appreciated! 所有回复都非常感谢!

nt NT

I ran into the same problem today. 我今天遇到了同样的问题。 I my usecase a small delay before the file is actually imported was not a big problem and I still wanted to use the NIO2 API. 我在实际导入文件之前使用了一小段延迟并不是一个大问题,我仍然想使用NIO2 API。 The solution I choose was to wait until a file has not been modified for 10 seconds before performing any operations on it. 我选择的解决方案是等待文件在10秒内没有被修改,然后对其执行任何操作。

The important part of the implementation is as follows. 实施的重要部分如下。 The program waits until the wait time expires or a new event occures. 程序等待,直到等待时间到期或发生新事件。 The expiration time is reset every time a file is modified. 每次修改文件时都会重置到期时间。 If a file is deleted before the wait time expires it is removed from the list. 如果在等待时间到期之前删除文件,则会从列表中删除该文件。 I use the poll method with a timeout of the expected expirationtime, that is (lastmodified+waitTime)-currentTime 我使用poll方法的预期到期时间超时,即(lastmodified + waitTime)-currentTime

private final Map<Path, Long> expirationTimes = newHashMap();
private Long newFileWait = 10000L;

public void run() {
    for(;;) {
        //Retrieves and removes next watch key, waiting if none are present.
        WatchKey k = watchService.take();

        for(;;) {
            long currentTime = new DateTime().getMillis();

            if(k!=null)
                handleWatchEvents(k);

            handleExpiredWaitTimes(currentTime);

            // If there are no files left stop polling and block on .take()
            if(expirationTimes.isEmpty())
                break;

            long minExpiration = min(expirationTimes.values());
            long timeout = minExpiration-currentTime;
            logger.debug("timeout: "+timeout);
            k = watchService.poll(timeout, TimeUnit.MILLISECONDS);
        }
    }
}

private void handleExpiredWaitTimes(Long currentTime) {
    // Start import for files for which the expirationtime has passed
    for(Entry<Path, Long> entry : expirationTimes.entrySet()) {
        if(entry.getValue()<=currentTime) {
            logger.debug("expired "+entry);
            // do something with the file
            expirationTimes.remove(entry.getKey());
        }
    }
}

private void handleWatchEvents(WatchKey k) {
    List<WatchEvent<?>> events = k.pollEvents();
    for (WatchEvent<?> event : events) {
        handleWatchEvent(event, keys.get(k));
    }
    // reset watch key to allow the key to be reported again by the watch service
    k.reset();
}

private void handleWatchEvent(WatchEvent<?> event, Path dir) throws IOException {
    Kind<?> kind = event.kind();

    WatchEvent<Path> ev = cast(event);
        Path name = ev.context();
        Path child = dir.resolve(name);

    if (kind == ENTRY_MODIFY || kind == ENTRY_CREATE) {
        // Update modified time
        FileTime lastModified = Attributes.readBasicFileAttributes(child, NOFOLLOW_LINKS).lastModifiedTime();
        expirationTimes.put(name, lastModified.toMillis()+newFileWait);
    }

    if (kind == ENTRY_DELETE) {
        expirationTimes.remove(child);
    }
}

Write another file as an indication that the original file is completed. 写另一个文件作为原始文件完成的指示。 Ig 'fileorg.dat' is growing if done create a file 'fileorg.done' and check only for the 'fileorg.done'. 如果完成创建文件'fileorg.done'并仅检查'fileorg.done',则Ig'fileorg.dat'正在增长。

With clever naming conventions you should not have problems. 通过巧妙的命名约定,您不应该遇到问题。

Two solutions: 两种解决方案

The first is a slight variation of the answer by stacker : 第一个是堆叠器的答案略有不同:

Use a unique prefix for incomplete files. 对不完整的文件使用唯一的前缀。 Something like myhugefile.zip.inc instead of myhugefile.zip . myhugefile.zip.inc而不是myhugefile.zip Rename the files when upload / creation is finished. 上传/创建完成后重命名文件。 Exclude .inc files from the watch. 从手表中排除.inc文件。

The second is to use a different folder on the same drive to create / upload / write the files and move them to the watched folder once they are ready. 第二种是在同一驱动器上使用不同的文件夹来创建/上传/写入文件,并在准备好后将它们移动到监视文件夹。 Moving should be an atomic action if they are on the same drive (file system dependent, I guess). 如果它们位于同一个驱动器上,那么移动应该是一个原子操作(我想是依赖于文件系统)。

Either way, the clients that create the files will have to do some extra work. 无论哪种方式,创建文件的客户端都必须做一些额外的工作。

I know it's an old question but maybe it can help somebody. 我知道这是一个老问题,但也许它可以帮助某人。

I had the same issue, so what I did was the following: 我有同样的问题,所以我做的是以下内容:

if (kind == ENTRY_CREATE) {
            System.out.println("Creating file: " + child);

            boolean isGrowing = false;
            Long initialWeight = new Long(0);
            Long finalWeight = new Long(0);

            do {
                initialWeight = child.toFile().length();
                Thread.sleep(1000);
                finalWeight = child.toFile().length();
                isGrowing = initialWeight < finalWeight;

            } while(isGrowing);

            System.out.println("Finished creating file!");

        }

When the file is being created, it will be getting bigger and bigger. 在创建文件时,它将变得越来越大。 So what I did was to compare the weight separated by a second. 所以我所做的是比较一秒钟分开的重量。 The app will be in the loop until both weights are the same. 应用程序将处于循环中,直到两个权重相同。

Looks like Apache Camel handles the file-not-done-uploading problem by trying to rename the file (java.io.File.renameTo). 看起来Apache Camel通过尝试重命名文件(java.io.File.renameTo)来处理文件未完成上传的问题。 If the rename fails, no read lock, but keep trying. 如果重命名失败,则没有读锁定,但继续尝试。 When the rename succeeds, they rename it back, then proceed with intended processing. 重命名成功后,将其重命名,然后继续进行预期处理。

See operations.renameFile below. 请参阅下面的operations.renameFile Here are the links to the Apache Camel source: GenericFileRenameExclusiveReadLockStrategy.java and FileUtil.java 以下是Apache Camel源代码的链接: GenericFileRenameExclusiveReadLockStrategy.javaFileUtil.java

public boolean acquireExclusiveReadLock( ... ) throws Exception {
   LOG.trace("Waiting for exclusive read lock to file: {}", file);

   // the trick is to try to rename the file, if we can rename then we have exclusive read
   // since its a Generic file we cannot use java.nio to get a RW lock
   String newName = file.getFileName() + ".camelExclusiveReadLock";

   // make a copy as result and change its file name
   GenericFile<T> newFile = file.copyFrom(file);
   newFile.changeFileName(newName);
   StopWatch watch = new StopWatch();

   boolean exclusive = false;
   while (!exclusive) {
        // timeout check
        if (timeout > 0) {
            long delta = watch.taken();
            if (delta > timeout) {
                CamelLogger.log(LOG, readLockLoggingLevel,
                        "Cannot acquire read lock within " + timeout + " millis. Will skip the file: " + file);
                // we could not get the lock within the timeout period, so return false
                return false;
            }
        }

        exclusive = operations.renameFile(file.getAbsoluteFilePath(), newFile.getAbsoluteFilePath());
        if (exclusive) {
            LOG.trace("Acquired exclusive read lock to file: {}", file);
            // rename it back so we can read it
            operations.renameFile(newFile.getAbsoluteFilePath(), file.getAbsoluteFilePath());
        } else {
            boolean interrupted = sleep();
            if (interrupted) {
                // we were interrupted while sleeping, we are likely being shutdown so return false
                return false;
            }
        }
   }

   return true;
}

While it's not possible to be notificated by the Watcher Service API when the SO finish copying, all options seems to be 'work around' (including this one!). 虽然SO完成复制时Watcher Service API无法通知,但所有选项似乎都是“解决”(包括这个!)。

As commented above, 如上所述,

1) Moving or copying is not an option on UNIX; 1)在UNIX上不能选择移动或复制;

2) File.canWrite always returns true if you have permission to write, even if the file is still being copied; 2)如果你有权写,File.canWrite总是返回true,即使文件仍然被复制;

3) Waits until the a timeout or a new event occurs would be an option, but what if the system is overloaded but the copy was not finished? 3)等待超时或新事件发生将是一个选项,但是如果系统过载但副本没有完成怎么办? if the timeout is a big value, the program would wait so long. 如果超时值很大,程序会等待很长时间。

4) Writing another file to 'flag' that the copy finished is not an option if you are just consuming the file, not creating. 4)如果您只是在使用文件而不是创建文件,则将另一个文件写入'标记'表示复制完成不是一个选项。

An alternative is to use the code below: 另一种方法是使用以下代码:

boolean locked = true;

while (locked) {
    RandomAccessFile raf = null;
    try {
            raf = new RandomAccessFile(file, "r"); // it will throw FileNotFoundException. It's not needed to use 'rw' because if the file is delete while copying, 'w' option will create an empty file.
            raf.seek(file.length()); // just to make sure everything was copied, goes to the last byte
            locked = false;
        } catch (IOException e) {
            locked = file.exists();
            if (locked) {
                System.out.println("File locked: '" + file.getAbsolutePath() + "'");
                Thread.sleep(1000); // waits some time
            } else { 
                System.out.println("File was deleted while copying: '" + file.getAbsolutePath() + "'");
            }
    } finally {
            if (raf!=null) {
                raf.close();    
            }
        }
}

Depending on how urgently you need to move the file once it is done being written, you can also check for a stable last-modified timestamp and only move the file one it is quiesced. 根据您在写入文件后需要移动文件的紧急程度,您还可以检查稳定的上次修改时间戳,并仅将文件移动到静止状态。 The amount of time you need it to be stable can be implementation dependent, but I would presume that something with a last-modified timestamp that hasn't changed for 15 secs should be stable enough to be moved. 你需要它保持稳定的时间可能取决于实现,但我认为具有最后修改时间戳的东西在15秒内没有改变应该足够稳定以便移动。

For large file in linux, the files gets copied with a extension of .filepart. 对于linux中的大文件,文件将被复制,扩展名为.filepart。 You just need to check the extension using commons api and register the ENTRY_CREATE event. 您只需要使用commons api检查扩展并注册ENTRY_CREATE事件。 I tested this with my .csv files(1GB) and add it worked 我用我的.csv文件(1GB)测试了这个并添加它工作

public void run()
{
    try
    {
        WatchKey key = myWatcher.take();
        while (key != null)
        {
            for (WatchEvent event : key.pollEvents())
            {
                if (FilenameUtils.isExtension(event.context().toString(), "filepart"))
                {
                    System.out.println("Inside the PartFile " + event.context().toString());
                } else
                {
                    System.out.println("Full file Copied " + event.context().toString());
                    //Do what ever you want to do with this files.
                }
            }
            key.reset();
            key = myWatcher.take();
        }
    } catch (InterruptedException e)
    {
        e.printStackTrace();
    }
}

If you don't have control over the write process, log all ENTRY_CREATED events and observe if there are patterns . 如果您无法控制写入过程,请记录所有ENTRY_CREATED事件并观察是否存在模式

In my case, the files are created via WebDav (Apache) and a lot of temporary files are created but also two ENTRY_CREATED events are triggered for the same file. 在我的例子中,文件是通过WebDav(Apache)创建的,并且创建了许多临时文件,但是同一文件也触发了两个 ENTRY_CREATED事件。 The second ENTRY_CREATED event indicates that the copy process is complete. 第二个ENTRY_CREATED事件表示复制过程已完成。

Here are my example ENTRY_CREATED events. 以下是我的示例ENTRY_CREATED事件。 The absolute file path is printed (your log may differ, depending on the application that writes the file): 打印绝对文件路径(您的日志可能会有所不同,具体取决于写入文件的应用程序):

[info] application - /var/www/webdav/.davfs.tmp39dee1 was created
[info] application - /var/www/webdav/document.docx was created
[info] application - /var/www/webdav/.davfs.tmp054fe9 was created
[info] application - /var/www/webdav/document.docx was created
[info] application - /var/www/webdav/.DAV/__db.document.docx was created 

As you see, I get two ENTRY_CREATED events for document.docx . 如您所见,我为document.docx收到两个ENTRY_CREATED事件。 After the second event I know the file is complete. 在第二个事件之后,我知道文件已完成。 Temporary files are obviously ignored in my case. 在我的情况下,临时文件显然被忽略了。

This is a very interesting discussion, as certainly this is a bread and butter use case: wait for a new file to be created and then react to the file in some fashion. 这是一个非常有趣的讨论,因为这肯定是一个面包和黄油用例:等待创建一个新文件,然后以某种方式对文件做出反应。 The race condition here is interesting, as certainly the high-level requirement here is to get an event and then actually obtain (at least) a read lock on the file. 这里的竞争条件很有意思,因为这里的高级要求当然是获取一个事件,然后实际获得(至少)文件的读锁定。 With large files or just simply lots of file creations, this could require a whole pool of worker threads that just periodically try to get locks on newly created files and, when they're successful, actually do the work. 对于大文件或仅仅是大量的文件创建,这可能需要整个工作线程池,它们只是周期性地尝试获取新创建的文件的锁定,并且当它们成功时,实际上完成工作。 But as I am sure NT realizes, one would have to do this carefully to make it scale as it is ultimately a polling approach, and scalability and polling aren't two words that go together well. 但是,正如我确信NT意识到的那样,人们必须小心翼翼地做到这一点,因为它最终是一种轮询方式,而可伸缩性和轮询并不是两个完美结合的词。

So, I had the same problem and had the following solution work for me. 所以,我有同样的问题,并有以下解决方案为我工作。 Earlier unsuccessful attempt - Trying to monitor the "lastModifiedTime" stat of each file but I noticed that a large file's size growth may pause for some time.(size does not change continuously) 早期尝试不成功 - 尝试监视每个文件的“lastModifiedTime”统计信息,但我注意到大文件的大小增长可能暂停一段时间。(大小不会连续变化)

Basic Idea - For every event, create a trigger file(in a temporary directory) whose name is of the following format - 基本思路 - 对于每个事件,创建一个触发器文件(在临时目录中),其名称具有以下格式 -

OriginalFileName_lastModifiedTime_numberOfTries OriginalFileName_lastModifiedTime_numberOfTries

This file is empty and all the play is only in the name. 此文件为空,所有播放仅在名称中。 The original file will only be considered after passing intervals of a specific duration without a change in it's "last Modified time" stat. 仅在传递特定持续时间的间隔后才会考虑原始文件,而不更改其“上次修改时间”统计信息。 (Note - since it's a file stat, there's no overhead -> O(1)) (注意 - 因为它是一个文件统计,没有开销 - > O(1))

NOTE - This trigger file is handled by a different service(say ' FileTrigger '). - 该触发器文件由不同的服务处理(比如' FileTrigger ')。

Advantage - 优势 -

  1. No sleep or wait to hold the system. 没有睡觉或等待保持系统。
  2. Relieves the file watcher to monitor other events 释放文件监视器以监视其他事件

CODE for FileWatcher - FileWatcher的代码 -

val triggerFileName: String = triggerFileTempDir + orifinalFileName + "_" + Files.getLastModifiedTime(Paths.get(event.getFile.getName.getPath)).toMillis + "_0"

// creates trigger file in temporary directory
val triggerFile: File = new File(triggerFileName)
val isCreated: Boolean = triggerFile.createNewFile()

if (isCreated)
    println("Trigger created: " + triggerFileName)
else
    println("Error in creating trigger file: " + triggerFileName)

CODE for FileTrigger (cron job of interval say 5 mins) - FileTrigger的代码(间隔的cron作业说5分钟) -

 val actualPath : String = "Original file directory here"
 val tempPath : String = "Trigger file directory here"
 val folder : File = new File(tempPath)    
 val listOfFiles = folder.listFiles()

for (i <- listOfFiles)
{

    // ActualFileName_LastModifiedTime_NumberOfTries
    val triggerFileName: String = i.getName
    val triggerFilePath: String = i.toString

    // extracting file info from trigger file name
    val fileInfo: Array[String] = triggerFileName.split("_", 3)
    // 0 -> Original file name, 1 -> last modified time, 2 -> number of tries

    val actualFileName: String = fileInfo(0)
    val actualFilePath: String = actualPath + actualFileName
    val modifiedTime: Long = fileInfo(1).toLong
    val numberOfTries: Int = fileStats(2).toInt

    val currentModifiedTime: Long = Files.getLastModifiedTime(Paths.get(actualFilePath)).toMillis
    val differenceInModifiedTimes: Long = currentModifiedTime - modifiedTime
    // checks if file has been copied completely(4 intervals of 5 mins each with no modification)
    if (differenceInModifiedTimes == 0 && numberOfTries == 3)
    {
        FileUtils.deleteQuietly(new File(triggerFilePath))
        println("Trigger file deleted. Original file completed : " + actualFilePath)
    }
    else
    {
        var newTriggerFileName: String = null
        if (differenceInModifiedTimes == 0)
        {
            // updates numberOfTries by 1
            newTriggerFileName = actualFileName + "_" + modifiedTime + "_" + (numberOfTries + 1)
        }
        else
        {
            // updates modified timestamp and resets numberOfTries to 0
            newTriggerFileName = actualFileName + "_" + currentModifiedTime + "_" + 0
        }

        // renames trigger file
        new File(triggerFilePath).renameTo(new File(tempPath + newTriggerFileName))
        println("Trigger file renamed: " + triggerFileName + " -> " + newTriggerFileName)
    }    
}

I had to deal with a similar situation when I implemented a file system watcher to transfer uploaded files. 当我实现文件系统观察器来传输上传的文件时,我不得不处理类似的情况。 The solution I implemented to solve this problem consists of the following: 我为解决这个问题而实施的解决方案包括以下内容:

1- First of all, maintain a Map of unprocessed file (As long as the file is still being copied, the file system generates Modify_Event, so you can ignore them if the flag is false). 1-首先,维护一个未处理文件的Map(只要文件仍在被复制,文件系统就会生成Modify_Event,因此如果标志为false,你可以忽略它们)。

2- In your fileProcessor, you pickup a file from the list and check if it's locked by the filesystem, if yes, you will get an exception, just catch this exception and put your thread in wait state (ie 10 seconds) and then retry again till the lock is released. 2-在你的fileProcessor中,你从列表中拾取一个文件并检查它是否被文件系统锁定,如果是,你将得到一个异常,只是捕获这个异常并让你的线程处于等待状态(即10秒),然后重试再次锁定被释放。 After processing the file, you can either change the flag to true or remove it from the map. 处理完文件后,您可以将标志更改为true或将其从地图中删除。

This solution will be not be efficient if the many versions of the same file are transferred during the wait timeslot. 如果在等待时间段期间传输同一文件的许多版本,则此解决方案将无效。

Cheers, Ramzi 干杯,拉姆齐

我推测java.io.File.canWrite()会告诉你文件写完的时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM