简体   繁体   English

如何从文件列表中删除路径部分并将其复制到另一个文件中?

[英]How to remove the path part from a list of files and copy it into another file?

I need to accomplish the following things with bash scripting in FreeBSD: 我需要在FreeBSD中使用bash脚本完成以下操作:

  • Create a directory. 创建一个目录。
  • Generate 1000 unique files whose names are taken from other random files in the system. 生成1000个唯一文件,其名称取自系统中的其他随机文件。
  • Each file must contain information about the original file whose name it has taken - name and size without the original contents of the file. 每个文件必须包含有关其名称所在的原始文件的信息 - 名称和大小,而不包含文件的原始内容。
  • The script must show information about the speed of its execution in ms. 该脚本必须以毫秒为单位显示有关其执行速度的信息。

What I could accomplish was to take the names and paths of 1000 unique files with the commands find and grep and put them in a list. 我能做的是用命令findgrep 1000个唯一文件的名称和路径,并将它们放在一个列表中。 Then I just can't imagine how to remove the path part and create the files in the other directory with names taken from the list of random files. 然后我无法想象如何删除路径部分并在其他目录中创建文件,其名称取自随机文件列表。 I tried a for loop with the basename command in it but somehow I can't get it to work and I don't know how to do the other tasks as well... 我尝试了一个带有basename命令的for循环,但不知怎的,我无法让它工作,我也不知道如何做其他任务......

[ Update : I've wanted to come back to this question to try to make my response more useful and portable across platforms (OS X is a Unix!) and $SHELLs, even though the original question specified bash and zsh. [ 更新 :我想回到这个问题,试图让我的响应在平台上更有用和可移植 (OS X是Unix!)和$ SHELL,即使原始问题指定了bash和zsh。 Other responses assumed a temporary file listing of "random" file names since the question did not show how the list was constructed or how the selection was made. 其他回复假定临时文件列表中的“随机”文件名,因为该问题未显示列表的构建方式或选择方式。 I show one method for constructing the list in my response using a temporary file. 我展示了一种使用临时文件在我的响应中构建列表的方法。 I'm not sure how one could randomize the find operation "inline" and hope someone else can show how this might be done (portably). 我不确定如何将find操作“内联”随机化,并希望其他人可以展示如何完成(可移植)。 I also hope this attracts some comments and critique: you never can know too many $SHELL tricks. 我也希望这会引起一些评论和批评:你永远不会知道太多$ SHELL技巧。 I removed the perl reference, but I hereby challenge myself to do this again in perl and - because perl is pretty portable - make it run on Windows. 我删除了perl引用,但我特此挑战自己在perl中再次执行此操作 - 因为perl非常便携 - 让它在Windows上运行。 I will wait a while for comments and then shorten and clean up this answer. 我会等待一段时间的评论,然后缩短并清理这个答案。 Thanks.] 谢谢。]

Creating the file listing 创建文件列表

You can do a lot with GNU find(1). 你可以用GNU find(1)做很多事情。 The following would create a single file with the file names and three, tab-separated columns of the data you want (name of file, location, size in kilobytes). 以下内容将创建一个文件名和一个选项卡分隔的所需数据列的单个文件(文件名,位置,大小,以千字节为单位)。

find / -type f -fprintf tmp.txt '%f\t%h/%f\t%k \n'

I'm assuming that you want to be random across all filenames ( ie no links) so you'll grab the entries from the whole file system. 我假设你想要在所有文件名中随机( 没有链接),所以你将从整个文件系统中获取条目。 I have 800000 files on my workstation but a lot of RAM, so this doesn't take too long to do. 我的工作站上有800000个文件,但RAM很多,所以这不需要太长时间。 My laptop has ~ 300K files and not much memory, but creating the complete listing still only took a couple minutes or so. 我的笔记本电脑有大约300K文件,没有太多内存,但创建完整的列表仍然只需要几分钟左右。 You'll want to adjust by excluding or pruning certain directories from the search. 您需要通过从搜索中排除或修剪某些目录来进行调整。

A nice thing about the -fprintf flag is that it seems to take care of spaces in file names. 关于-fprintf标志的一个-fprintf是它似乎在处理文件名中的空格。 By examining the file with vim and sed ( ie looking for lines with spaces) and comparing the output of wc -l and uniq you can get a sense of your output and whether the resulting listing is sane or not. 通过使用vimsed检查文件( 查找带空格的行)并比较wc -luniq的输出,您可以了解输出以及生成的列表是否合理。 You could then pipe this through cut , grep or sed , awk and friends in order to to create the files in the way you want. 然后你可以通过cutgrepsedawk和friends管道,以便以你想要的方式创建文件。 For example from the shell prompt: 例如,从shell提示符:

~/# touch `cat tmp.txt |cut -f1` 
~/# for i in `cat tmp.txt|cut -f1`; do cat tmp.txt | grep $i > $i.dat ; done

I'm giving the files we create a .dat extension here to distinguish them from the files to which they refer, and to make it easier to move them around or delete them, you don't have to do that: just leave off the extension $i > $i . 我在这里给我们创建一个.dat扩展名的文件,以区别于他们引用的文件,并且更容易移动它们或删除它们,你不必这样做:只需要离开扩展名$i > $i

The bad thing about the -fprintf flag is that it is only available with GNU find and is not a POSIX standard flag so it won't be available on OS X or BSD find(1) (though GNU find may be installed on your Unix as gfind or gnufind ). 有关坏事 -fprintf标志是,它仅适用于GNU找到,是不是POSIX标准标志,这样它将无法使用在OS X或BSD find(1)虽然GNU发现可能对你的Unix安装作为gfindgnufind )。 A more portable way to do this is to create a straight up list of files with find / -type f > tmp.txt (this takes about 15 seconds on my system with 800k files and many slow drives in a ZFS pool. Coming up with something more efficient should be easy for people to do in the comments!). 一种更便携的方法是使用find / -type f > tmp.txt创建一个直接的文件列表(这在我的系统上需要大约15秒,在ZFS池中有800k文件和许多慢速驱动器。一些更有效的东西应该很容易让人们在评论中做!)。 From there you can create the data values you want using standard utilities to process the file listing as Florin Stingaciu shows above. 从那里,您可以使用标准实用程序创建所需的数据值,以处理文件列表,如上面的Florin Stingaciu所示。

#!/bin/sh

# portably get a random number (OS X, BSD, Linux and $SHELLs w/o $RANDOM)
randnum=`od -An -N 4 -D < /dev/urandom` ; echo $randnum


  for file in `cat tmp.txt`
   do
      name=`basename $file`
      size=`wc -c $file |awk '{print $1}'`

# Uncomment the next line to see the values on STDOUT 
#      printf "Location: $name \nSize: $size \n"

# Uncomment the next line to put data into the respective .dat files 
#      printf "Location: $file \nSize: $size \n" > $name.dat

 done

# vim: ft=sh

If you've been following this far you'll realize that this will create a lot of files - on my workstation this would create 800k of .dat files which is not what we want! 如果您一直关注这一点,您会发现这将创建大量文件 - 在我的工作站上, 这将创建800k.dat文件,这不是我们想要的! So, how to randomly select 1000 files from our listing of 800k for processing? 那么,如何从我们的800k列表中随机选择1000个文件进行处理? There's several ways to go about it. 有几种方法可以解决它。

Randomly selecting from the file listing 从文件列表中随机选择

We have a listing of all the files on the system (!). 我们列出了系统上的所有文件(!)。 Now in order to select 1000 files we just need to randomly select 1000 lines from our listing file ( tmp.txt ). 现在,为了选择1000个文件,我们只需要从列表文件( tmp.txt )中随机选择1000行。 We can set an upper limit of the line number to select by generating a random number using the cool od technique you saw above - it's so cool and cross-platform that I have this aliased in my shell ;-) - then performing modulo division ( % ) on it using the number of lines in the file as the divisor. 我们可以通过使用你在上面看到的酷od技术生成一个随机数来设置行号的上限 - 它是如此酷和跨平台,我把这个别名放在我的shell中;-) - 然后执行模数除法% )使用文件中的行数作为除数。 Then we just take that number and select the line in the file to which it corresponds with awk or sed ( eg sed -n <$RANDOMNUMBER>p filelist ), iterate 1000 times and presto! 然后我们只取这个数字,然后选择与awk或sed对应的文件中的行( 例如 sed -n <$RANDOMNUMBER>p filelist ),迭代1000次并预先设置! We have a new list of 1000 random files. 我们有一个包含1000个随机文件的新列表。 Or not ... it's really slow! 或者不......它真的很慢! While looking for a way to speed up awk and sed I came across an excellent trick using dd from Alex Lines that searches the file by bytes (instead of lines) and translates the result into a line using sed or awk . 在寻找加速awksed我遇到了一个很好的技巧,使用来自Alex Lines的dd按字节(而不是行)搜索文件,并使用sedawk将结果转换为一行。 See Alex's blog for the details. 有关详细信息,请参阅Alex的博客 My only problems with his technique came with setting the count= switch to a high enough number. 我的技术唯一的问题是将count= switch设置为足够高的数字。 For mysterious reasons (which I hope someone will explain) - perhaps because my locale is LC_ALL=en_US.UTF-8 - dd would spit incomplete lines into randlist.txt unless I set count= to a much higher number that the actual maximum line length. 出于神秘的原因(我希望有人会解释) - 也许是因为我的localeLC_ALL=en_US.UTF-8 - dd会将不完整的行吐出到randlist.txt除非我将count=设置为更高的数字,即实际的最大行长度。 I think I was probably mixing up characters and bytes. 我想我可能会混淆字符和字节。 Any explanations? 有什么解释吗?

So after the above caveats and hoping it works on more than two platforms, here's my attempt at solving the problem: 因此,在上述警告之后,并希望它可以在两个以上的平台上运行,这是我尝试解决问题的方法:

#!/bin/sh
IFS='
'                                                                                
# We create tmp.txt with                                                        
# find / -type f > tmp.txt  # tweak as needed.                                  
#                                                                               
files="tmp.txt"                                                           

# Get the number of lines and maximum line length for later                                                                              
bytesize=`wc -c < $files`                                                 
# wc -L is not POSIX and we need to multiply so:
linelenx10=`awk '{if(length > x) {x=length; y = $0} }END{print x*10}' $files`

# A function to generate a random number modulo the                             
# number of bytes in the file. We'll use this to find a                         
# random location in our file where we can grab a line                          
# using dd and sed. 

genrand () {                                                                    
  echo `od -An -N 4 -D < /dev/urandom` ' % ' $bytesize | bc                     
}                                                                               

rm -f randlist.txt                                                             

i=1                                                                             
while [ $i -le 1000 ]                                                          
do                             
 # This probably works but is way too slow: sed -n `genrand`p $files                
 # Instead, use Alex Lines' dd seek method:
 dd if=$files skip=`genrand` ibs=1 count=$linelenx10 2>/dev/null |awk 'NR==2 {print;exit}'>> randlist.txt

 true $((i=i+1))    # Bourne shell equivalent of $i++ iteration    
done  

for file in `cat randlist.txt`                                                 
  do                                                                           
   name=`basename $file`                                                        
   size=`wc -c <"$file"`                                 
   echo -e "Location: $file \n\n Size: $size" > $name.dat  
  done    

# vim: ft=sh 

What I could accomplish was to take the names and paths of 1000 unique files with the commands "find" and "grep" and put them in a list 我能做的是用命令“find”和“grep”获取1000个唯一文件的名称和路径,并将它们放在一个列表中

I'm going to assume that there is a file that holds on each line a full path to each file (FULL_PATH_TO_LIST_FILE). 我将假设有一个文件在每一行上保存每个文件的完整路径(FULL_PATH_TO_LIST_FILE)。 Considering there's not much statistics associated with this process, I omitted that. 考虑到这个过程没有太多的统计数据,我省略了。 You can add your own however. 但是,您可以添加自己的。

cd WHEREVER_YOU_WANT_TO_CREATE_NEW_FILES
for file_path in `cat FULL_PATH_TO_LIST_FILE`
do
     ## This extracts only the file name from the path
     file_name=`basename $file_path`

     ## This grabs the files size in bytes
     file_size=`wc -c < $file_path`

     ## Create the file and place info regarding original file within new file
     echo -e "$file_name \nThis file is $file_size bytes "> $file_name

done

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM