简体   繁体   中英

How to create a pandas dataframe from a txt file with comments?

I need to create a pandas dataframe based on 4 txt files with comments (to skip while reading) based on the following structure:

# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=Tondeuse

# Propriétés générales
hauteur=0.5
masse=20.0
prix=110.00

# Propriétés du moteur
impulsion specifique=80

and

# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=Civic VTEC

# Propriétés générales
hauteur=2.0
masse=3000.0
prix=2968.00

# Propriétés du moteur
impulsion specifique=205

and

# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=VelociRAPTOR

# Propriétés générales
hauteur=4.0
masse=2000.0
prix=6000.00

# Propriétés du moteur
impulsion specifique=250

and

# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=La Puissance

# Propriétés générales
hauteur=12.0
masse=15000.0
prix=39000.00

# Propriétés du moteur
impulsion specifique=295

That's the result I need to have:

            nom  hauteur    masse     prix  impulsion specifique
0      Tondeuse      0.5     20.0    110.0                    80
1    Civic VTEC      2.0   3000.0   2968.0                   205
2  VelociRAPTOR      4.0   2000.0   6000.0                   250
3  La Puissance     12.0  15000.0  39000.0                   295

I don't know if it's possible, but that's what i was asked to do

Your data files look very close to configuration files. You can use configparser to generate a dictionary from each file:

from pathlib import Path
from configparser import ConfigParser

data = []
for file in Path("data").glob("*.txt"):
    parser = ConfigParser()
    # INI file requires a section header. Yours don't have one.
    # So let's give it one called DEFAULT
    parser.read_string("[DEFAULT]\n" + file.read_text())
    data.append(dict(parser.items("DEFAULT")))

df = pd.DataFrame(data)

welcome to Stackoverflow: :)

If your txt files have their content like you just showed, you could read them in using pandas as a CSV file.

The pandas.read_csv function has multiple things that will help you here:

  • It outputs a dataframe, which is the format you would like to end up with
  • Has a comment input argument, with which you can define lines that are to be ignored
  • You can use the = sign as a separator, which will make you able to split up your data in the wanted sections

Now, let's try to read one of your files using the read_csv function:

import pandas as pd
df = pd.read_csv(file, comment='#', sep='=', header=None)
df
                    nom  Tondeuse                                                                                                                                                                                                                                               
0               hauteur       0.5                                                                                                                                                                                                                                               
1                 masse      20.0                                                                                                                                                                                                                                               
2                  prix     110.0                                                                                                                                                                                                                                               
3  impulsion specifique      80.0

We're not completely there yet. We want to remove that index column that gives no info, and we want to transpose the dataframe (rows <-> columns) to be able to concatenate all dataframes together. Let's do it!

import pandas as pd
df = pd.read_csv(file, comment='#', sep='=', header=None, index_col=0).T
df

0       nom hauteur masse    prix impulsion specifique                                                                                                                                                                                                                          
1  Tondeuse     0.5  20.0  110.00                   80

That's looking way better! Putting index_col=0 makes the lefternmost column be the index column, and the .T at the end transposes your dataframe. Now we just need to put this inside of a loop and make a complete script out of it!

import pandas as pd
import glob
import os

files = glob.glob(os.path.join(path, '*.csv'))

all_dfs = []
for file in files:
    current_df = pd.read_csv(file, comment='#', sep='=', header=None, index_col=0).T
    all_dfs.append(current_df)

total_df = pd.concat(all_dfs)
total_df

0           nom hauteur    masse      prix impulsion specifique                                                                                                                                                                                                                 
1  La Puissance    12.0  15000.0  39000.00                  295                                                                                                                                                                                                                 
1    Civic VTEC     2.0   3000.0   2968.00                  205                                                                                                                                                                                                                 
1  VelociRAPTOR     4.0   2000.0   6000.00                  250                                                                                                                                                                                                                 
1      Tondeuse     0.5     20.0    110.00                   80

Notice that you still have that lefternmost column with the index number, I did not clean it out because I wasn't sure of what you wanted there.

Also, importantly, you need to be aware that if there is a slight difference in the names of the columns in your files (eg impulsion specifique vs impulsion spécifique ) this will bring errors. You will need to create error handling procedures for these. Or maybe enforcing a certain schema, but that is out of the scope of this question.

I hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM