简体   繁体   中英

Remove Lines from File which not appear in another File, error

I have two files, similar to the ones below:

File 1 - with phenotype informations, the first column are the individual, the orinal file has 400 rows:

215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745

File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.

215          20211111201200125201212202220111202005111102
222          20111011212200025002211001111120211015112111
216          20210005201100025210212102210212201005101001
223          20222120201200125202202102210121201005010101
217          20211010202200025201202102210121201005010101
218          02022000252012021022101212010050101012021101

And I need to remove from file 2 individuals that do not appear in the file 1, for example:

215          20211111201200125201212202220111202005111102
222          20111011212200025002211001111120211015112111
216          20210005201100025210212102210212201005101001
223          20222120201200125202202102210121201005010101 

I could do this with this code:

 awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file1 file2> file3 

However, when I do my main analysis with the generated file the following error appears:

*** Error in `./airemlf90': free(): invalid size: 0x00007f5041cc2010 ***
*** Error in `./postGSf90': free(): invalid size: 0x00007fec4a04f010 ***

airemlf90 and postGSf90 are software. But when I use original file this problem does not occur. Does the command that I made to delete individuals is adequate? Another detail that did not say is that some individuals have identification with 4 characters, can be this the error?

Thanks

I wrote a small python script in a few minutes. Works well, I have tested with 42000-char lines and it works fine.

import sys,re

# rudimentary argument parsing

file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]

present = set()

# first read file 1, discard all fields except the first one (the key)
with open(file1,"r") as f1:
    for l in f1:
        toks = re.split("\s+",l)    # same as awk fields
        if toks:   # robustness against empty lines
            present.add(toks[0])

#now read second one and write in third one only if id is in the set

with open(file2,"r") as f2:
    with open(file3,"w") as f3:
        for l in f2:
            toks = re.split("\s+",l)
            if toks and toks[0] in present:
                f3.write(l)

(First install python if not already present.)

Call my sample script mytool.py and run it like this:

python mytool.py file1.txt file2.txt file3.txt

To process several files at once simply in a bash file (to replace the original solution) it's easy (although not optimal because could be done in a whirl in python)

<whatever the for loop you need>; do
  python my_tool.py $1 $2 $3
done

exactly like you would call awk with 3 files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM