I have 2 files: - phrases.txt - words_to_erase.txt
I need a way to find all the phrases from 'phrases.txt' that contain at least 1 word from the 'words_to_replace.txt' file and create the following:
new_phrases.txt: this is the new file without all the phrases found on the previous step.
erased_phrases: this file contains all the phrases that were erased to create the 'new_phrases.txt' file
I can either use python or linux for this.
Note:
phrases.txt is a file that contains 100k phrases, 1 phrase per line
words_to_erase.txt is a file that contains 80 different words, 1 word per line.
I tried using Linux:
grep -f words_to_erase.txt phrases.txt > newfile.txt
this way I only get a file with the new phrases without the replaced phrases, I don't think this case insensitive though, I tried using -i with it and it doesn't seem to work.
I tried python with something like:
in_file = open("words_to_erase.txt", "rt")
contents = in_file.read(line)
in_file.close()
print contents
sourcefile = "phrases.txt"
filename2 = "newfile.txt"
def fixup( filename ):
print "fixup ", filename
fin = open( filename )
fout = open( filename2 , "w")
for line in contents:
if not any(item in line for item in contents):
fout.write(line)
fin.close()
fout.close()
fixup(sourcefile)
I used this script to grep from a file that contained 400k phrases(phrases.txt) and erase all lines that contained a word from a file that contained 1,000 words(words_to_erase.txt), the script took about 15 minutes to finish but with 100% accuracy.
Note.- When I was using grep -f words_to_erase.txt phrases.txt, grep was skipping many phrases that included words from words_to_erase.txt file, with this bash script it searches word by word and it outputs the correct number of phrases.
To create the script: Copy this script and paste it on a text editor, save it with any name and extension .sh
#!/bin/bash cat words_to_erase.txt | while read line do echo $line grep -iwv $line phrases.txt >> newfile.txt cat newfile.txt | sort | uniq >> final_file.txt done
2.- Make the script excecutable:
chmod -x $name_of_script.sh
Run the script:
./$name_of_script.sh
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.