简体   繁体   English

使用来自文本文件 Python 的 Pandas 创建两个数据帧

[英]Create two dataframes using Pandas from a text file Python

I need to create two dataframes to operate my data and I have thinked about doing it with pandas.我需要创建两个数据框来操作我的数据,并且我考虑过使用 pandas 来做这件事。

This is the provided data:这是提供的数据:

class([1,0,0,0],"Small-molecule metabolism ").
class([1,1,0,0],"Degradation ").
class([1,1,1,0],"Carbon compounds ").
function(tb186,[1,1,1,0],'bglS',"beta-glucosidase").
function(tb2202,[1,1,1,0],'cbhK',"carbohydrate kinase").
function(tb727,[1,1,1,0],'fucA',"L-fuculose phosphate aldolase").
function(tb1731,[1,1,1,0],'gabD1',"succinate-semialdehyde dehydrogenase").
function(tb234,[1,1,1,0],'gabD2',"succinate-semialdehyde dehydrogenase").
function(tb501,[1,1,1,0],'galE1',"UDP-glucose 4-epimerase").
function(tb536,[1,1,1,0],'galE2',"UDP-glucose 4-epimerase").
function(tb620,[1,1,1,0],'galK',"galactokinase").
function(tb619,[1,1,1,0],'galT',"galactose-1-phosphate uridylyltransferase C-term").
function(tb618,[1,1,1,0],'galT',"null").
function(tb993,[1,1,1,0],'galU',"UTP-glucose-1-phosphate uridylyltransferase").
function(tb3696,[1,1,1,0],'glpK',"ATP:glycerol 3-phosphotransferase").
function(tb3255,[1,1,1,0],'manA',"mannose-6-phosphate isomerase").
function(tb3441,[1,1,1,0],'mrsA',"phosphoglucomutase or phosphomannomutase").
function(tb118,[1,1,1,0],'oxcA',"oxalyl-CoA decarboxylase").
function(tb3068,[1,1,1,0],'pgmA',"phosphoglucomutase").
function(tb3257,[1,1,1,0],'pmmA',"phosphomannomutase").
function(tb3308,[1,1,1,0],'pmmB',"phosphomannomutase").
function(tb2702,[1,1,1,0],'ppgK',"polyphosphate glucokinase").
function(tb408,[1,1,1,0],'pta',"phosphate acetyltransferase").
function(tb729,[1,1,1,0],'xylB',"xylulose kinase").
function(tb1096,[1,1,1,0],'null',"null").
class([1,1,2,0],"Amino acids and amines ").
function(tb1905,[1,1,2,0],'aao',"D-amino acid oxidase").
function(tb2531,[1,1,2,0],'adi',"ornithine/arginine decarboxylase").
function(tb2780,[1,1,2,0],'ald',"L-alanine dehydrogenase").
function(tb1538,[1,1,2,0],'ansA',"L-asparaginase").
function(tb1001,[1,1,2,0],'arcA',"arginine deiminase").
function(tb753,[1,1,2,0],'mmsA',"methylmalmonate semialdehyde dehydrogenase").
function(tb751,[1,1,2,0],'mmsB',"methylmalmonate semialdehyde oxidoreductase").

And I would like to have something like:我想有类似的东西:

类数据框

函数数据名

Is it possible with Pandas? Pandas 可以吗? Thanks is advance,谢谢提前,

Yes it is possible.对的,这是可能的。 Bellow is an example.贝娄就是一个例子。
There are many ways for doing it (some already in other answers).有很多方法可以做到这一点(一些已经在其他答案中)。 In this example I tried to make the steps clearer in the code.在此示例中,我尝试使代码中的步骤更清晰。

import io
import pandas as pd

with open("file.txt") as f:
    lines = f.readlines()  # reads your file line by line and returns a list

### sample:
# ['class([1,0,0,0],"Small-molecule metabolism ").\n',
#  'class([1,1,0,0],"Degradation ").\n',
#  'class([1,1,1,0],"Carbon compounds ").\n',
#  'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").\n', ... ]

df1 = []
df2 = []

for line in lines:
    # this transformation will be common to all lines
    line = line.strip(').\n').replace("[", '"[').replace("]", ']"')

    # here we will separate the lines, perform the specific transformation and append them to their specific variable
    if line.startswith("class"):
        line = line.strip("class(")  # specific transform for "class" line
        df1.append(line)
    else:
        line = line.strip("function(")  # specific transform for "function" line
        df2.append(line)

# in this final block we prepare the variable to be read with pandas and read
df1 = "\n".join(df1)  # prepare
df1 = pd.read_csv(
    io.StringIO(df1),  # as pandas expects a file handler, we use io.StringIO
    header=None,  # no headers, they are given "manually"
    names=['id', 'name'],  # headers
)

# the same as before
df2 = "\n".join(df2)
df2 = pd.read_csv(
    io.StringIO(df2),
    header=None,
    names=['orf', 'class', 'genName', 'desc']
)

I make a file with your text.我用你的文字制作了一个文件。 and here's the code.这是代码。 you can repeat it for df_func.您可以对 df_func 重复它。 enjoy.请享用。

cols = ['x','y']
df = pd.read_csv('1.txt',sep='(',names=cols, header=None)
df.head()
df_class = df[df['x']=='class']
df_func = df[df['x']=='function']
df_class[['y', 'z']] =df_class['y'].str.split(',"', 1, expand=True)
df_class['z'] = df_class['z'].str[:-4]
df_class

Read each line of the file, then, for each line:读取文件的每一行,然后,对于每一行:

# line contains the line of the file for this iteration
if line.startswith("class"):
    line.replace("class(","[").replace(").","]")
    line = eval(line)
    # class type stuff
elif line.startswith("function"):
    line.replace("function(","[").replace(").","]")
    line = eval(line)
    # function type stuff

The resulting line variable will be the list of the elements for the line.结果线变量将是该线的元素列表。 Then you can do whatever you need with it.然后你可以用它做任何你需要的事情。

Example: first line = [[1,0,0,0],"Small-molecule metabolism "]示例:第一行 = [[1,0,0,0],"Small-molecule metabolism "]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM