I have a tabular file that looks like this:
query_name KEGG_KOs
PROKKA_00013 NaN
PROKKA_00015 bactNOG[38]
PROKKA_00017 NA|NA|NA
PROKKA_00019 K00240
PROKKA_00020 K00246
PROKKA_00022 K02887
I'm trying to create a script to go through and delete the entire row if column 2 ('KEGG_KOs') does not begin with 'K0'. I'm trying to create an output of:
query_name KEGG_KOs
PROKKA_00019 K00240
PROKKA_00020 K00246
PROKKA_00022 K02887
Previous responses have referred people to pandas DataFrame but I've had no luck using those responses to help. Any would be greatly appreciated, cheers.
I had tried (but this only isolates a specific K0 line.
df = pd.read_csv("eggnog.txt", delimiter="\t", names=["#query_name", "KEGG_KOs"])
print(df.loc[df['KEGG_KOs'] == 'K00240'])
Use boolean indexing
with startswith
or contains
with regex
for start of string ^
and parameter na=False
, because missing values:
df1 = df[df['KEGG_KOs'].str.startswith('K0', na=False)]
print (df1)
query_name KEGG_KOs
3 PROKKA_00019 K00240
4 PROKKA_00020 K00246
5 PROKKA_00022 K02887
Or:
df1 = df[df['KEGG_KOs'].str.contains('^K0', na=False)]
you can use open as read and then write. assume the original file saved as old.txt and the updated one will be saved as new.txt
text = ''
with open("old.txt", 'r') as org:
next(org)
for line in org:
data = line.strip().split()
if data[1].startswith("K0"):
text = text + data[0] + " "+ data[1] + '\n'
w = open('new.txt', 'w')
w.write("query_name"+" "+ "KEGG_KOs\n")
w.write(text)
w.close()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.