使用shell / python / perl仅从文件中提取一次行

Question

I have a big file with numbers, for example: 我有一个带数字的大文件，例如：

Daily I extract some numbers of the big file and save this date numbers in second file. 每日我提取一些大文件的数量，并将此日期数字保存在第二个文件中。 Each day new numbers are added to the source data in my big file. 每天都有新数字添加到我的大文件中的源数据中。 I need to make a filter for the extracting job that ensures I do not extract numbers I have already extracted. 我需要为提取作业创建一个过滤器，以确保我不提取已经提取的数字。 How might I do this as bash or python script? 我怎么能这样做bash或python脚本？

Note: I can not remove the numbers of from the source data "big file" I need it to remain intact, because when I finish extracting numbers from the file, I need the original + updated data for the next day's job. 注意：我无法从源数据中删除数字“大文件”我需要它保持完整，因为当我完成从文件中提取数字时，我需要原始+更新的数据用于第二天的工作。 If I create a copy of the file and I remove the numbers of the copy, the new numbers that are added are not taken into consideration. 如果我创建了文件的副本并删除了副本的编号，则不会考虑添加的新编号。

Answer 1

Read in all numbers from the big file into a set, then test new numbers against that: 读入大文件中的所有数字到一个集合，然后测试新的数字：

with open('bigfile.txt') as bigfile:
    existing_numbers = {n.strip() for n in bigfile}

with open('newfile.txt') as newfile, open('bigfile.txt', 'w') as bigfile:
    for number in newfile:
        number = number.strip()
        if number not in existing_numbers:
            bigfile.write(number + '\n')

This adds numbers not already in bigfile to the end, in as efficient a way as possible. 这会以尽可能高效的方式将bigfile尚未包含的数字添加到最后。

If bigfile becomes too big for the above to run efficiently, you may need to use a database instead. 如果bigfile变得太大而无法高效运行，则可能需要使用数据库。

Answer 2

You can save a sorted version of your source files and extracted data to temporary files and you could use a standard POSIX tool like comm to show the common lines/records. 您可以将源文件的已排序版本和提取的数据保存到临时文件中，并且可以使用标准POSIX工具（如comm来显示公共行/记录。 Those lines record would be the basis of the "filter" you'd use in your subsequent extract jobs. 这些行记录将是您在后续提取作业中使用的“过滤器”的基础。 If you are extracting records from the source.txt file with $SHELL commands then something like grep -v [list of common lines] would be part of your script -a long with whatever other criteria you are using for extracting the records. 如果您使用$SHELL命令从source.txt文件中提取记录，那么像grep -v [list of common lines]这样的内容将成为您脚本的一部分 - 与您用于提取记录的其他条件一样长。 For best results the source.txt and extracted.txt files should be sorted. 为获得最佳结果，应对source.txt和extracted.txt文件进行排序。

Here's a quick cut and paste of typical comm output. 这是典型comm输出的快速剪切和粘贴。 The sequence shows the "Big File", the extracted data, and then the final comm command which shows lines unique to the source.txt file (see man comm(1) for how comm works). 该序列显示了“大文件”，所提取的数据，然后将最终的comm命令，它呈现出别具一格的线source.txt （见文件man comm(1)如何comm工程）。 Following that is an example of searching using an arbitrary pattern with grep and as a "filter" excluding the common files. 接下来是使用grep的任意模式进行搜索的示例，以及除常见文件之外的“过滤器”。

% cat source.txt                           
3120987654
3106982658
3420787642
3210957659
3320987654
3520987654
3520987754
3520987954
3520988654
3520987444

% cat extracted.txt 
3120987654
3106982658
3420787642
3210957659
3320987654

% comm -2 -3 source.txt extracted.txt  # show lines only in source.txt
3520987754
3520987954
3520988654
3520987444

comm selects or rejects lines common to two files. comm选择或拒绝两个文件共有的行。 The utility conforms to IEEE Std 1003.2-1992 (“POSIX.2”). 该实用程序符合IEEE Std 1003.2-1992（“POSIX.2”）。 We can save its output for use with grep : 我们可以将其输出保存为与grep一起使用：

% comm -1 -2 source.txt extracted.txt | sort > common.txt
% grep -v -f common.txt source.txt | grep -E ".*444$"

This would grep the source.txt files and exclude lines common to source.txt and extracted.txt ; 这将grep source.txt文件并排除source.txt和extracted.txt常用的行; then pipe ( | ) and grep these "filtered" results for a new record to extract (in this case a line or lines ending in "444"). 然后管道（ | ）和grep这些“过滤”结果，以提取新记录（在这种情况下是以“444”结尾的一行或多行）。 If the files are very large or if you want to preserve the order of the numbers in original file and the extracted data, then the question is more complex and the response will need to be more elaborate. 如果文件非常大，或者如果要保留原始文件中数字的顺序和提取的数据，那么问题就更复杂了，响应需要更详细。

See my other response or the start of a simplistic alternative approach that uses perl . 请参阅我的其他回复或使用perl的简单替代方法的开始。

Answer 3

I think you're not asking for unique values, but you want all the new values added since the last time you looked at the file? 我认为您不是要求唯一值，但是您希望自上次查看文件以来添加了所有新值？

Assume the BigFile gets new data all the time. 假设BigFile一直在获取新数据。

We want DailyFilemm_dd_yy to contain the new numbers received during the previous 24 hours. 我们希望DailyFilemm_dd_yy包含过去24小时内收到的新号码。

This script will do what you want. 这个脚本会做你想要的。 Run it each day. 每天运行它。

BigFile=bigfile
DailyFile=dailyfile
today=$(date +"%m_%d_%Y")
# Get the month, day, year for yesterday.
yesterday=$(date -jf "%s" $(($(date +"%s") - 86400)) +"%m_%d_%Y")

cp $BigFile $BigFile$today
comm -23 $BigFile $BigFile$yesterday > $DailyFile$today
rm $BigFile$yesterday

comm shows the lines not in both files. comm显示不在两个文件中的行。

Example of comm: comm的例子：

#values added to big file
echo '111
222
333' > big

cp big yesterday

# New values added to big file over the day
echo '444
555' >> big

# Find out what values were added.
comm -23 big yesterday > today
cat today

output 产量

444
555

Answer 4

Lazyish perl approach. 懒惰的perl方法。

Just write your own selection() subroutine to replace grep {/.*444$/} ;-) 只需编写自己的selection()子程序来替换grep {/.*444$/} ;-)

#!/usr/bin/env perl  
use strict; use warnings; use autodie;                      
use 5.16.0 ; 

use Tie::File;        
use Array::Utils qw(:all); 

tie my @source, 'Tie::File', 'source.txt' ;               
tie my @extracted, 'Tie::File', 'extracted.txt' ;

# Find the intersection                                                   
my @common = intersect(@source, @extracted);                      

say "Numbers already extracted"; 
say for @common       

untie @@source;
untie @extracted;

Once the source.txt file has been updated you could select from it: 更新source.txt文件后，您可以从中进行选择：

#!/usr/bin/env perl  
use strict; use warnings; use autodie;              
use 5.16.0 ; 

use Tie::File;        
use Array::Utils qw(:all); 

tie my @source, 'Tie::File', 'source.txt' ;               
tie my @extracted, 'Tie::File', 'extracted.txt' ;

# Find the intersection                                                   
my @common = intersect(@source, @extracted);                      

# Select from source.txt excluding numbers already selected:
my @newselect = array_minus(@source, @common);
say "new selection:";
# grep returns list $selection needs "()" for list context.
my ($selection) = grep {/.*444$/} @newselect; 
push @extracted, $selection ;
say "updated extracted.txt" ; 

untie @@source;
untie @extracted;

This uses two modules ... succinct and idiomatic versions welcome! 这使用两个模块...欢迎简洁和惯用的版本！

使用shell / python / perl仅从文件中提取一次行

问题描述

4 个解决方案

解决方案1
2 2013-07-27 15:15:59

解决方案2
1 2013-07-27 15:16:15

解决方案3
0 2013-07-27 22:51:29

output 产量

解决方案4
0 2013-07-28 03:40:26

使用shell / python / perl仅从文件中提取一次行

问题描述

4 个解决方案

解决方案1 2 2013-07-27 15:15:59

解决方案2 1 2013-07-27 15:16:15

解决方案3 0 2013-07-27 22:51:29

output 产量

解决方案4 0 2013-07-28 03:40:26

解决方案1
2 2013-07-27 15:15:59

解决方案2
1 2013-07-27 15:16:15

解决方案3
0 2013-07-27 22:51:29

解决方案4
0 2013-07-28 03:40:26