简体   繁体   中英

Find all phrases that contain at least one word from a list file and save them to new files

I have 2 files: - phrases.txt - words_to_erase.txt

I need a way to find all the phrases from 'phrases.txt' that contain at least 1 word from the 'words_to_replace.txt' file and create the following:

new_phrases.txt: this is the new file without all the phrases found on the previous step.

erased_phrases: this file contains all the phrases that were erased to create the 'new_phrases.txt' file

I can either use python or linux for this.

Note:

phrases.txt is a file that contains 100k phrases, 1 phrase per line

words_to_erase.txt is a file that contains 80 different words, 1 word per line.

I tried using Linux:

grep -f words_to_erase.txt phrases.txt > newfile.txt

this way I only get a file with the new phrases without the replaced phrases, I don't think this case insensitive though, I tried using -i with it and it doesn't seem to work.

I tried python with something like:

in_file = open("words_to_erase.txt", "rt") 
contents = in_file.read(line)         
in_file.close()     
print contents              

sourcefile = "phrases.txt"
filename2 = "newfile.txt"

def fixup( filename ): 
    print "fixup ", filename 
    fin = open( filename ) 
    fout = open( filename2 , "w") 
    for line in contents: 
        if not any(item in line for item in contents):
                fout.write(line)  
    fin.close() 
    fout.close() 

fixup(sourcefile)

I used this script to grep from a file that contained 400k phrases(phrases.txt) and erase all lines that contained a word from a file that contained 1,000 words(words_to_erase.txt), the script took about 15 minutes to finish but with 100% accuracy.

Note.- When I was using grep -f words_to_erase.txt phrases.txt, grep was skipping many phrases that included words from words_to_erase.txt file, with this bash script it searches word by word and it outputs the correct number of phrases.

  1. To create the script: Copy this script and paste it on a text editor, save it with any name and extension .sh

     #!/bin/bash cat words_to_erase.txt | while read line do echo $line grep -iwv $line phrases.txt >> newfile.txt cat newfile.txt | sort | uniq >> final_file.txt done 

2.- Make the script excecutable:

    chmod -x $name_of_script.sh
  1. Run the script:

     ./$name_of_script.sh 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM