[英]Create two dataframes using Pandas from a text file Python
我需要创建两个数据框来操作我的数据,并且我考虑过使用 pandas 来做这件事。
这是提供的数据:
class([1,0,0,0],"Small-molecule metabolism ").
class([1,1,0,0],"Degradation ").
class([1,1,1,0],"Carbon compounds ").
function(tb186,[1,1,1,0],'bglS',"beta-glucosidase").
function(tb2202,[1,1,1,0],'cbhK',"carbohydrate kinase").
function(tb727,[1,1,1,0],'fucA',"L-fuculose phosphate aldolase").
function(tb1731,[1,1,1,0],'gabD1',"succinate-semialdehyde dehydrogenase").
function(tb234,[1,1,1,0],'gabD2',"succinate-semialdehyde dehydrogenase").
function(tb501,[1,1,1,0],'galE1',"UDP-glucose 4-epimerase").
function(tb536,[1,1,1,0],'galE2',"UDP-glucose 4-epimerase").
function(tb620,[1,1,1,0],'galK',"galactokinase").
function(tb619,[1,1,1,0],'galT',"galactose-1-phosphate uridylyltransferase C-term").
function(tb618,[1,1,1,0],'galT',"null").
function(tb993,[1,1,1,0],'galU',"UTP-glucose-1-phosphate uridylyltransferase").
function(tb3696,[1,1,1,0],'glpK',"ATP:glycerol 3-phosphotransferase").
function(tb3255,[1,1,1,0],'manA',"mannose-6-phosphate isomerase").
function(tb3441,[1,1,1,0],'mrsA',"phosphoglucomutase or phosphomannomutase").
function(tb118,[1,1,1,0],'oxcA',"oxalyl-CoA decarboxylase").
function(tb3068,[1,1,1,0],'pgmA',"phosphoglucomutase").
function(tb3257,[1,1,1,0],'pmmA',"phosphomannomutase").
function(tb3308,[1,1,1,0],'pmmB',"phosphomannomutase").
function(tb2702,[1,1,1,0],'ppgK',"polyphosphate glucokinase").
function(tb408,[1,1,1,0],'pta',"phosphate acetyltransferase").
function(tb729,[1,1,1,0],'xylB',"xylulose kinase").
function(tb1096,[1,1,1,0],'null',"null").
class([1,1,2,0],"Amino acids and amines ").
function(tb1905,[1,1,2,0],'aao',"D-amino acid oxidase").
function(tb2531,[1,1,2,0],'adi',"ornithine/arginine decarboxylase").
function(tb2780,[1,1,2,0],'ald',"L-alanine dehydrogenase").
function(tb1538,[1,1,2,0],'ansA',"L-asparaginase").
function(tb1001,[1,1,2,0],'arcA',"arginine deiminase").
function(tb753,[1,1,2,0],'mmsA',"methylmalmonate semialdehyde dehydrogenase").
function(tb751,[1,1,2,0],'mmsB',"methylmalmonate semialdehyde oxidoreductase").
我想有类似的东西:
Pandas 可以吗? 谢谢提前,
对的,这是可能的。 贝娄就是一个例子。
有很多方法可以做到这一点(一些已经在其他答案中)。 在此示例中,我尝试使代码中的步骤更清晰。
import io
import pandas as pd
with open("file.txt") as f:
lines = f.readlines() # reads your file line by line and returns a list
### sample:
# ['class([1,0,0,0],"Small-molecule metabolism ").\n',
# 'class([1,1,0,0],"Degradation ").\n',
# 'class([1,1,1,0],"Carbon compounds ").\n',
# 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").\n', ... ]
df1 = []
df2 = []
for line in lines:
# this transformation will be common to all lines
line = line.strip(').\n').replace("[", '"[').replace("]", ']"')
# here we will separate the lines, perform the specific transformation and append them to their specific variable
if line.startswith("class"):
line = line.strip("class(") # specific transform for "class" line
df1.append(line)
else:
line = line.strip("function(") # specific transform for "function" line
df2.append(line)
# in this final block we prepare the variable to be read with pandas and read
df1 = "\n".join(df1) # prepare
df1 = pd.read_csv(
io.StringIO(df1), # as pandas expects a file handler, we use io.StringIO
header=None, # no headers, they are given "manually"
names=['id', 'name'], # headers
)
# the same as before
df2 = "\n".join(df2)
df2 = pd.read_csv(
io.StringIO(df2),
header=None,
names=['orf', 'class', 'genName', 'desc']
)
我用你的文字制作了一个文件。 这是代码。 您可以对 df_func 重复它。 请享用。
cols = ['x','y']
df = pd.read_csv('1.txt',sep='(',names=cols, header=None)
df.head()
df_class = df[df['x']=='class']
df_func = df[df['x']=='function']
df_class[['y', 'z']] =df_class['y'].str.split(',"', 1, expand=True)
df_class['z'] = df_class['z'].str[:-4]
df_class
读取文件的每一行,然后,对于每一行:
# line contains the line of the file for this iteration
if line.startswith("class"):
line.replace("class(","[").replace(").","]")
line = eval(line)
# class type stuff
elif line.startswith("function"):
line.replace("function(","[").replace(").","]")
line = eval(line)
# function type stuff
结果线变量将是该线的元素列表。 然后你可以用它做任何你需要的事情。
示例:第一行 = [[1,0,0,0],"Small-molecule metabolism "]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.