適用於 OSX 和 raspbian 的蛋白質數據庫的 Python 腳本在 Ubuntu 中不起作用

Question

由於某種原因，我的 python 腳本在 MAC OSX 和 raspbian buster 中工作（是的，我在絕望的時刻在樹莓中嘗試了它）但它在 Ubuntu 18 中不起作用，所以我在我的主 PC 中使用。 我什至嘗試在其他 PC 上全新安裝 Ubuntu Mate 20，但仍然無法正常工作。

這是腳本：

import sys
import csv
from http.client import IncompleteRead
import pandas as pd
from Bio import Entrez
Entrez.email = ""

    

# get from WPs accession, corresponding assembly, NC IDs, strains names. Write a csv table with all these as final data tablee,
#+ a table with WPs and Assembly IDs for inputting in FLAG

list_of_accession = []
with open (sys.argv[1], 'r') as csvfile:
    efetchin=csv.reader(csvfile, delimiter = ',')
    for row in efetchin:
        list_of_accession.append(str(row[0]))
        
with open('efetch_output.txt', mode = 'w') as efetch_output:
    efetch_output = csv.writer(efetch_output, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    efetch_output.writerow(['ID','Source', 'Nucleotide Accession', 'Start', 'Stop', 'Strand', 'Protein', 'Protein Name', 'Organism', ' Strain', 'Assembly'])

input_handle = Entrez.efetch(db="protein", id= list_of_accession, rettype="ipg", retmode="tsv")
for line in input_handle:
    print(line, file=open('efetch_output.txt','a'))
input_handle.close()
#process file in pandas
file_name = "efetch_output.txt"
file_name_output = "final_output.tsv"
df = pd.read_csv(file_name, sep="\t", low_memory=False)
# Get names of indexes for which rows have to be dropped
indexNames = df[ df['Source'] == 'INSDC'].index
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
#rearrange table columns
df = df[['ID', 'Source', 'Nucleotide Accession', 'Protein', 'Protein Name', 'Start', 'Stop', 'Strand', 'Organism',' Strain', 'Assembly']]
#Sort table on Assembly number ignoring GCF_
df['sort'] = df['Assembly'].str.extract('(\d+)', expand=False).astype(str)
df.sort_values('sort',inplace=True, ascending=True)
df = df.drop('sort', axis=1)
#drop all duplicates that're similar in indicated subset fields
df3=df.drop_duplicates(subset=['Start', 'Stop', 'Strand', 'Organism',' Strain', 'Assembly'],keep='first')
#sorts dataframe alphabetically by Organism and writes to csv
df3.sort_values(by = "Organism", axis=0, ascending=True, inplace=False).to_csv("final_parsed_output.tsv", "\t", index=False)
#get WP_X and GFC_X IDs in a tsv to input in FLAGs
new_dataframe1 = df3[['Assembly', 'Protein']]
new_dataframe2 = df3[['Organism',' Strain', 'Assembly', 'Protein']]
new_dataframe1.sort_values(by = "Protein", axis=0, ascending=True, inplace=False).to_csv('flags_input.tsv', '\t', header=False, columns = ['Assembly', 'Protein'])
new_dataframe2.sort_values(by = "Organism", axis=0, ascending=True, inplace=False).to_csv('flags_input_wstrains.tsv', '\t', header=False, columns = ['Organism',' Strain', 'Assembly', 'Protein'])





print ('program finished')

我不知道我是否可以在這里上傳一個 csv 作為您可以使用的示例。 但它們基本上是 csv 中的蛋白質列表，如下所示：

WP_047566605.1 WP_043586512.1 WP_086526429.1 WP_043669791.1 WP_086513259.1 WP_086518190.1 WP_053774664.1 WP_012298127.1 WP_063071144.1 WP_012038522.1 WP_066595335.1 WP_088456184.1 WP_058743206.1 WP_042537210.1 WP_058724426.1

我在 ubuntu mate 20 中遇到的錯誤是：

jj@p4:~/Documents/Bioinformatica/Bioinformatic/August/Codes/Etna$ python3 etna.py JJTEST.csv 
/usr/local/lib/python3.8/dist-packages/pandas/core/computation/expressions.py:68: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  return op(a, b)
Traceback (most recent call last):
  File "etna.py", line 44, in <module>
    df['sort'] = df['Assembly'].str.extract('(\d+)', expand=False).astype(str)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 5126, in __getattr__
    return object.__getattribute__(self, name)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/strings.py", line 2100, in __init__
    self._inferred_dtype = self._validate(data)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/strings.py", line 2157, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!

Answer 1

我不完全明白是什么問題，但我已將輸出文件從 txt 修改為 csv，並將 de tsv str 更改為 float。 現在它正在工作。

適用於 OSX 和 raspbian 的蛋白質數據庫的 Python 腳本在 Ubuntu 中不起作用

問題描述

1 個解決方案

解決方案1
0 2020-09-15 13:01:53

適用於 OSX 和 raspbian 的蛋白質數據庫的 Python 腳本在 Ubuntu 中不起作用

問題描述

1 個解決方案

解決方案1 0 2020-09-15 13:01:53

解決方案1
0 2020-09-15 13:01:53