简体   繁体   中英

Create two dataframes using Pandas from a text file Python

I need to create two dataframes to operate my data and I have thinked about doing it with pandas.

This is the provided data:

class([1,0,0,0],"Small-molecule metabolism ").
class([1,1,0,0],"Degradation ").
class([1,1,1,0],"Carbon compounds ").
function(tb186,[1,1,1,0],'bglS',"beta-glucosidase").
function(tb2202,[1,1,1,0],'cbhK',"carbohydrate kinase").
function(tb727,[1,1,1,0],'fucA',"L-fuculose phosphate aldolase").
function(tb1731,[1,1,1,0],'gabD1',"succinate-semialdehyde dehydrogenase").
function(tb234,[1,1,1,0],'gabD2',"succinate-semialdehyde dehydrogenase").
function(tb501,[1,1,1,0],'galE1',"UDP-glucose 4-epimerase").
function(tb536,[1,1,1,0],'galE2',"UDP-glucose 4-epimerase").
function(tb620,[1,1,1,0],'galK',"galactokinase").
function(tb619,[1,1,1,0],'galT',"galactose-1-phosphate uridylyltransferase C-term").
function(tb618,[1,1,1,0],'galT',"null").
function(tb993,[1,1,1,0],'galU',"UTP-glucose-1-phosphate uridylyltransferase").
function(tb3696,[1,1,1,0],'glpK',"ATP:glycerol 3-phosphotransferase").
function(tb3255,[1,1,1,0],'manA',"mannose-6-phosphate isomerase").
function(tb3441,[1,1,1,0],'mrsA',"phosphoglucomutase or phosphomannomutase").
function(tb118,[1,1,1,0],'oxcA',"oxalyl-CoA decarboxylase").
function(tb3068,[1,1,1,0],'pgmA',"phosphoglucomutase").
function(tb3257,[1,1,1,0],'pmmA',"phosphomannomutase").
function(tb3308,[1,1,1,0],'pmmB',"phosphomannomutase").
function(tb2702,[1,1,1,0],'ppgK',"polyphosphate glucokinase").
function(tb408,[1,1,1,0],'pta',"phosphate acetyltransferase").
function(tb729,[1,1,1,0],'xylB',"xylulose kinase").
function(tb1096,[1,1,1,0],'null',"null").
class([1,1,2,0],"Amino acids and amines ").
function(tb1905,[1,1,2,0],'aao',"D-amino acid oxidase").
function(tb2531,[1,1,2,0],'adi',"ornithine/arginine decarboxylase").
function(tb2780,[1,1,2,0],'ald',"L-alanine dehydrogenase").
function(tb1538,[1,1,2,0],'ansA',"L-asparaginase").
function(tb1001,[1,1,2,0],'arcA',"arginine deiminase").
function(tb753,[1,1,2,0],'mmsA',"methylmalmonate semialdehyde dehydrogenase").
function(tb751,[1,1,2,0],'mmsB',"methylmalmonate semialdehyde oxidoreductase").

And I would like to have something like:

类数据框

函数数据名

Is it possible with Pandas? Thanks is advance,

Yes it is possible. Bellow is an example.
There are many ways for doing it (some already in other answers). In this example I tried to make the steps clearer in the code.

import io
import pandas as pd

with open("file.txt") as f:
    lines = f.readlines()  # reads your file line by line and returns a list

### sample:
# ['class([1,0,0,0],"Small-molecule metabolism ").\n',
#  'class([1,1,0,0],"Degradation ").\n',
#  'class([1,1,1,0],"Carbon compounds ").\n',
#  'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").\n', ... ]

df1 = []
df2 = []

for line in lines:
    # this transformation will be common to all lines
    line = line.strip(').\n').replace("[", '"[').replace("]", ']"')

    # here we will separate the lines, perform the specific transformation and append them to their specific variable
    if line.startswith("class"):
        line = line.strip("class(")  # specific transform for "class" line
        df1.append(line)
    else:
        line = line.strip("function(")  # specific transform for "function" line
        df2.append(line)

# in this final block we prepare the variable to be read with pandas and read
df1 = "\n".join(df1)  # prepare
df1 = pd.read_csv(
    io.StringIO(df1),  # as pandas expects a file handler, we use io.StringIO
    header=None,  # no headers, they are given "manually"
    names=['id', 'name'],  # headers
)

# the same as before
df2 = "\n".join(df2)
df2 = pd.read_csv(
    io.StringIO(df2),
    header=None,
    names=['orf', 'class', 'genName', 'desc']
)

I make a file with your text. and here's the code. you can repeat it for df_func. enjoy.

cols = ['x','y']
df = pd.read_csv('1.txt',sep='(',names=cols, header=None)
df.head()
df_class = df[df['x']=='class']
df_func = df[df['x']=='function']
df_class[['y', 'z']] =df_class['y'].str.split(',"', 1, expand=True)
df_class['z'] = df_class['z'].str[:-4]
df_class

Read each line of the file, then, for each line:

# line contains the line of the file for this iteration
if line.startswith("class"):
    line.replace("class(","[").replace(").","]")
    line = eval(line)
    # class type stuff
elif line.startswith("function"):
    line.replace("function(","[").replace(").","]")
    line = eval(line)
    # function type stuff

The resulting line variable will be the list of the elements for the line. Then you can do whatever you need with it.

Example: first line = [[1,0,0,0],"Small-molecule metabolism "]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM