简体   繁体   中英

Creating pandas dataframe from a list of strings

I have the foll. list:

list_vals = ['col_a col_B col_C', '12.0 34.0 10.0', '15.0 111.0 23']

How can I convert it into a pandas dataframe?

I can start like this:

df = pd.DataFrame(columns=list_vals[0].split())

Is there a way to populate rest of dataframe?

You could use io.StringIO to feed a string into read_csv :

In [23]: pd.read_csv(io.StringIO('\n'.join(list_vals)), delim_whitespace=True)
Out[23]: 
   col_a  col_B  col_C
0   12.0   34.0   10.0
1   15.0  111.0   23.0

This has the advantage that it automatically does the type interpretation that pandas would do if you were reading an ordinary csv-- the columns are floats:

In [24]: _.dtypes
Out[24]: 
col_a    float64
col_B    float64
col_C    float64
dtype: object

While you could just feed your list into the DataFrame constructor directly, everything would stay strings:

In [21]: pd.DataFrame(columns=list_vals[0].split(), 
                      data=[row.split() for row in list_vals[1:]])
Out[21]: 
  col_a  col_B col_C
0  12.0   34.0  10.0
1  15.0  111.0    23

In [22]: _.dtypes
Out[22]: 
col_a    object
col_B    object
col_C    object
dtype: object

We could add dtype=float to fix this, of course, but we might have mixed types, which the read_csv approach would handle in the usual way and here we'd have to do manually.

You can do it by converting to your data to dict, eg:

>>> pd.DataFrame({a: b for a, *b in (zip(*map(str.split, list_vals)))})
   col_B col_C col_a
0   34.0  10.0  12.0
1  111.0    23  15.0

Or with your original order:

>>> pd.DataFrame({a: b for a, *b in (zip(*map(str.split, list_vals)))},
...              columns=list_vals[0].split())
  col_a  col_B col_C
0  12.0   34.0  10.0
1  15.0  111.0    23

You can read this as a numpy structured array , then pass it over to pandas. This is useful if you also need to work with numpy, and have the data types defined before reading (otherwise numpy is a step back to work with compared to pandas).

import numpy as np
import pandas as pd

list_vals = ['col_a col_B col_C', '12.0 34.0 10.0', '15.0 111.0 23']

# Gather names from first line, assume all column types are 'd' (i.e. float)
list_dtype = np.dtype([(name, 'd') for name in list_vals[0].split()])

# Create a numpy structured array
ar = np.fromiter((tuple(x.split()) for x in list_vals[1:]), dtype=list_dtype)

# Now convert it to a pandas DataFrame
dat = pd.DataFrame(ar)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM