簡體   English   中英

適用於 OSX 和 raspbian 的蛋白質數據庫的 Python 腳本在 Ubuntu 中不起作用

[英]Python script for protein databases that works in OSX and raspbian is not working in Ubuntu

由於某種原因,我的 python 腳本在 MAC OSX 和 raspbian buster 中工作(是的,我在絕望的時刻在樹莓中嘗試了它)但它在 Ubuntu 18 中不起作用,所以我在我的主 PC 中使用。 我什至嘗試在其他 PC 上全新安裝 Ubuntu Mate 20,但仍然無法正常工作。

這是腳本:

import sys
import csv
from http.client import IncompleteRead
import pandas as pd
from Bio import Entrez
Entrez.email = ""

    

# get from WPs accession, corresponding assembly, NC IDs, strains names. Write a csv table with all these as final data tablee,
#+ a table with WPs and Assembly IDs for inputting in FLAG

list_of_accession = []
with open (sys.argv[1], 'r') as csvfile:
    efetchin=csv.reader(csvfile, delimiter = ',')
    for row in efetchin:
        list_of_accession.append(str(row[0]))
        
with open('efetch_output.txt', mode = 'w') as efetch_output:
    efetch_output = csv.writer(efetch_output, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    efetch_output.writerow(['ID','Source', 'Nucleotide Accession', 'Start', 'Stop', 'Strand', 'Protein', 'Protein Name', 'Organism', ' Strain', 'Assembly'])

input_handle = Entrez.efetch(db="protein", id= list_of_accession, rettype="ipg", retmode="tsv")
for line in input_handle:
    print(line, file=open('efetch_output.txt','a'))
input_handle.close()
#process file in pandas
file_name = "efetch_output.txt"
file_name_output = "final_output.tsv"
df = pd.read_csv(file_name, sep="\t", low_memory=False)
# Get names of indexes for which rows have to be dropped
indexNames = df[ df['Source'] == 'INSDC'].index
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
#rearrange table columns
df = df[['ID', 'Source', 'Nucleotide Accession', 'Protein', 'Protein Name', 'Start', 'Stop', 'Strand', 'Organism',' Strain', 'Assembly']]
#Sort table on Assembly number ignoring GCF_
df['sort'] = df['Assembly'].str.extract('(\d+)', expand=False).astype(str)
df.sort_values('sort',inplace=True, ascending=True)
df = df.drop('sort', axis=1)
#drop all duplicates that're similar in indicated subset fields
df3=df.drop_duplicates(subset=['Start', 'Stop', 'Strand', 'Organism',' Strain', 'Assembly'],keep='first')
#sorts dataframe alphabetically by Organism and writes to csv
df3.sort_values(by = "Organism", axis=0, ascending=True, inplace=False).to_csv("final_parsed_output.tsv", "\t", index=False)
#get WP_X and GFC_X IDs in a tsv to input in FLAGs
new_dataframe1 = df3[['Assembly', 'Protein']]
new_dataframe2 = df3[['Organism',' Strain', 'Assembly', 'Protein']]
new_dataframe1.sort_values(by = "Protein", axis=0, ascending=True, inplace=False).to_csv('flags_input.tsv', '\t', header=False, columns = ['Assembly', 'Protein'])
new_dataframe2.sort_values(by = "Organism", axis=0, ascending=True, inplace=False).to_csv('flags_input_wstrains.tsv', '\t', header=False, columns = ['Organism',' Strain', 'Assembly', 'Protein'])





print ('program finished')

我不知道我是否可以在這里上傳一個 csv 作為您可以使用的示例。 但它們基本上是 csv 中的蛋白質列表,如下所示:

WP_047566605.1 WP_043586512.1 WP_086526429.1 WP_043669791.1 WP_086513259.1 WP_086518190.1 WP_053774664.1 WP_012298127.1 WP_063071144.1 WP_012038522.1 WP_066595335.1 WP_088456184.1 WP_058743206.1 WP_042537210.1 WP_058724426.1

我在 ubuntu mate 20 中遇到的錯誤是:

jj@p4:~/Documents/Bioinformatica/Bioinformatic/August/Codes/Etna$ python3 etna.py JJTEST.csv 
/usr/local/lib/python3.8/dist-packages/pandas/core/computation/expressions.py:68: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  return op(a, b)
Traceback (most recent call last):
  File "etna.py", line 44, in <module>
    df['sort'] = df['Assembly'].str.extract('(\d+)', expand=False).astype(str)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 5126, in __getattr__
    return object.__getattribute__(self, name)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/strings.py", line 2100, in __init__
    self._inferred_dtype = self._validate(data)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/strings.py", line 2157, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!

我不完全明白是什么問題,但我已將輸出文件從 txt 修改為 csv,並將 de tsv str 更改為 float。 現在它正在工作。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM