简体   繁体   中英

How to read unstructured csv file like this to pandas dataframe?

在此处输入图像描述

When I do:

data = pd.read_csv('temp.csv',sep = ',',header = None)

I got:

0   age=Middle-aged,education=Bachelors,native-cou...
1   age=Middle-aged,education=Bachelors,native-cou...

The row number is correct, but how do I extracted the variable names(headers) such as age, education, native-country and use the value after '=' as the value for each header

You can split those long strings on the commas and stack it all into one big Series. Then extract the fields around the '=' to get the column name and the value. Pivot this to reshape back to one row per original index.

(df[0].str.split(',', expand=True).stack()
      .str.extractall(r'(?P<col>.*)=(?P<val>.*)')
      .reset_index([-1,-2], drop=True)
      .pivot(columns='col', values='val')
      .rename_axis(columns=None))

           age  education native-country   race
0  Middle-aged  Bachelors  United-States  White
1  Middle-aged  Bachelors  United-States  White

Sample Data

d = {0: {0: 'age=Middle-aged,education=Bachelors,native-country=United-States,race=White', 
         1: 'age=Middle-aged,education=Bachelors,native-country=United-States,race=White'}}

how about splitting with = and then taking the last element of list using pandas applymap function?

for example: this should do it.

df = df.applymap(lambda x: x.split('=')[-1])

           age  education
0  Middle-aged  Bachelors
1  Middle-aged  Bachelors

There are a number of ways to do this. If you know the column names, the simplest is way is to use the converters argument to read_csv() . Pass in a dict mapping column names or number to a function. Here the function splits the string on the = and returns the part on the right.

converters = {n:lambda s:s.split('=')[1] for n in range(3)}

pd.read_csv(f, converters=converters, header=None, names='age education native-country'.split())

Returns:

    age         education   native-country
0   Middle-aged Bachelors   United States
1   Middle-aged Bachelors   United States

An alternative way to make progress on this is to make sure that the input file is a valid CSV-formatted file (if it possible to change the format of your temp.csv file).

In a CSV file, the values in each cell are not prefixed with the column name so, the lines in the file should look like this Middle-aged,Bachelors,United-States,White rather than this age=Middle-aged,education=Bachelors,native-country=United-States,race=White .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM