I am parsing data Text file which displays list index out of range. Its working in some files while it is not working for some other text files. I need your help in debug this script.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sys
import os
import re
from collections import OrderedDict
from numpy import unique
def main():
if len(sys.argv) < 2:
print("usage: python3 {} <bacmat_out_table> > output".format(sys.argv[0]))
sys.exit(1)
bacmat_out = os.path.abspath(sys.argv[1])
class_sum = OrderedDict()
with open(bacmat_out) as fh:
for line in fh:
if re.search(r"^\s*$|^Query", line):
continue
elif len(line) == 0:
break
else:
fields = line.strip().split("\t")
compounds = fields[6]
if re.search(r'\[.*\]', compounds):
compounds_class = re.findall('\[class:\s?(.+?)\]', compounds)
compounds_class = list(unique(compounds_class))
if len(compounds_class) > 0:
for i in compounds_class:
class_sum.setdefault(i, 0)
class_sum[i] += 1
else:
compounds = compounds.strip('"')
compounds = compounds.strip("'")
compounds = compounds.strip()
class_sum.setdefault(compounds, 0)
class_sum[compounds] += 1
print("Class\tCount")
for key in sorted(class_sum.keys()):
print(key, class_sum[key], sep="\t")
enter code here
if __name__ == '__main__':
main()
File for which its working
Query Subject Gene Description Organism Location Compounds Percent identity Match length E-value Score per length
BAC0001|abeM|tr|Q5FAM9|Q5FAM9_ACIBA gi|445995506|ref|WP_000073361.1| abeM "H-coupled multidrug efflux pump. Confers resistance to Antibiotics such as quinolones and aminoglycosides and antibacterial biocides such as dyes, QACs. " Acinetobacter baumannii Chromosome "4,6-diamidino-2-phenylindole (DAPI) [class: Diamidine], Triclosan [class: Phenolic compounds], Acriflavine [class: Acridine], Hoechst 33342 [class: Bisbenzimide], Rhodamine 6G [class: Xanthene], Ethidium Bromide [class: Phenanthridine], Tetraphenylphosphonium (TPP) [class: Quaternary Ammonium Compounds (QACs)]" 100.0 448 1.3e-243 1.87857142857143
BAC0002|abeS|tr|Q2FD83|Q2FD83_ACIBA gi|446043276|ref|WP_000121131.1| abeS "Disinfectant resistance protein abeS. It can confer resistance to antibiotics such as erythromycin, novomycin, amikacin, ciprofloxacin, norfloxacin, tetracycline, trimethoporin and dyes, QACs etc. " Acinetobacter calcoaceticus/baumannii complex Chromosome "Benzylkonium Chloride (BAC) [class: Quaternary Ammonium Compounds (QACs)], Ethidium Bromide [class: Phenanthridine], Acriflavine [class: Acridine], Chlorhexidine [class: Biguanides], Pyronin Y [class: Xanthene], Rhodamine 6G [class: Xanthene], Methyl Viologen [class: Paraquat], Tetraphenylphosphonium (TPP) [class: Quaternary Ammonium Compounds (QACs)], 4,6-diamidino-2-phenylindole (DAPI) [class: Diamindine], Acridine Orange [class: Acridine], Sodium Dodecyl Sulfate (SDS) [class: Organo-sulfate], Sodium Deoxycholate (SDC) [class: Acid], Crystal Violet [class: Triarylmethane], Cetrimide (CTM) [class: Quaternary Ammonium Compounds (QACs)], Cetylpyridinium Chloride (CPC) [class: Quaternary Ammonium Compounds (QACs)], Dequalinium [class: Quaternary Ammonium Compounds (QACs)]" 100.0 109 9.5e-52 1.85504587155963
BAC0003|acn|tr|O53166|O53166_MYCTU gi|489995855|ref|WP_003898889.1| acn "Aconitate hydratase, Acn" Mycobacterium Chromosome Iron (Fe) 100.0 943 0.0e+00 2.03467656415695
BAC0004|acr3|tr|B5LX01|B5LX01_CAMJU gi|488947840|ref|WP_002858915.1| acr3 "Arsenical-resistance membrane transporter; part of the an arsenic (ars) four-gene operon, containing genes encoding a putative membrane permease (ArsP), a transcriptional repressor (ArsR), an arsenate reductase (ArsC) and an arsenical-resistance membrane transporter (Acr3)" Campylobacter Chromosome Arsenic (As) 100.0 347 4.2e-178 1.7971181556196
BAC0005|acrA|sp|P0AE06|ACRA_ECOLI gi|481023858|ref|WP_001295324.1| acrA "AcrAB is a drug efflux protein with a broad substrate specificity. It can confer resistant to ampicillin, chloramphenicol as well. It requires TolC outer memberane protein to function and form the AcrAB-TolC efflux operon. AcrAB-TolC is a drug efflux protein complex with broad substrate specificity that uses the proton motive force to export substrates." Proteobacteria Chromosome "Acriflavine [class: Acridine], Phenol [class: Phenolic compounds], Triclosan [class: Phenolic compounds], p-xylene [class: Aromatic hydrocarbons], Cyclohexane [class: Cycloalkane], Pentane [class: Alkane]" 100.0 397 4.5e-216 1.88916876574307
BAC0006|acrB|sp|P31224|ACRB_ECOLI gi|447055213|ref|WP_001132469.1| acrB "AcrAB is a drug efflux protein with a broad substrate specificity. It can confer resistant to ampicillin, chloramphenicol as well.It requires TolC outer memberane protein to function and form the AcrAB-TolC efflux operon. AcrAB-TolC is a drug efflux protein complex with broad substrate specificity that uses the proton motive force to export substrates." Enterobacteriaceae Chromosome "Acriflavine [class: Acridine], Phenol [class: Phenolic compounds], Triclosan [class: Phenolic compounds], p-xylene [class: Aromatic hydrocarbons], Cyclohexane [class: Cycloalkane], Pentane [class: Alkane]" 100.0 1049 0.0e+00 1.89733079122974
BAC0007|acrC|tr|Q1LMP2|Q1LMP2_RALME gi|499835702|ref|WP_011516436.1| acrC Cation/multidrug efflux system outer membrane porin arcC. Cupriavidus metallidurans Chromosome Acriflavine [class: Acridine] 100.0 486 2.8e-268 1.90061728395062
BAC0563|acrD|tr|Q8ZN77|Q8ZN77_SALTY gi|447185822|ref|WP_001263078.1| acrD Acriflavine resistance protein D; participates in the efflux of aminoglycosides. It confers resistance to a variety of these substances. It contributes to copper and zinc resistance in Salmonella. Salmonella enterica Chromosome "Copper (Cu), Zinc (Zn)" 100.0 1037 0.0e+00 1.90781099324976
File for which its not working
Query Subject Gene Description Organism Location Compounds Percent identity Match length E-value Score per length
ERZ1645190.265-NODE-265-length-2544-cov-3.002812_2 gi|1083034424|gb|OGD35356.1| copB Copper (Cu) Candidatus Atribacteria bacterium RBG_16_35_8 copper-translocating P-type ATPase, partial
80.7 135 2.40e-65 1.56296296296296
ERZ1645190.6825-NODE-6825-length-778-cov-1.752420_2 gi|1133586191|gb|APW63482.1| actP Copper (Cu), Sodium acetate [class: Acetate] Paludisphaera borealis Copper-transporting P-type ATPase
81.4 161 8.72e-78 1.5527950310559
ERZ1645190.14825-NODE-14825-length-656-cov-1.279534_1 gi|1084819878|gb|OGQ54449.1| arrA Arsenic (As) Deltaproteobacteria bacterium RIFCSPLOWO2_02_56_12 dehydrogenase
90.5 63 1.54e-32 1.98412698412698
ERZ1645190.15611-NODE-15611-length-649-cov-1.912458_1 gi|1082733223|gb|OGA52347.1| arrA Arsenic (As) Betaproteobacteria bacterium RIFCSPLOWO2_12_FULL_62_13 dehydrogenase
85.6 216 2.42e-131 1.81018518518519
Running the script results in error below:
python bacmet_class_summary.py test_bacmet.table > 1.txt
Traceback (most recent call last):
File "bacmet_class_summary.py", line 52, in <module>
main()
File "bacmet_class_summary.py", line 33, in main
compounds = fields[6]
IndexError: list index out of range
This is the error I am getting while I tried to work with the second example
One of the lines in your file have less than 7 fields when you split by '\t'. Use print(line)
before compounds = fields[6]
to see which one.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.