When I do:
data = pd.read_csv('temp.csv',sep = ',',header = None)
I got:
0 age=Middle-aged,education=Bachelors,native-cou...
1 age=Middle-aged,education=Bachelors,native-cou...
The row number is correct, but how do I extracted the variable names(headers) such as age, education, native-country and use the value after '=' as the value for each header
You can split
those long strings on the commas and stack it all into one big Series. Then extract the fields around the '='
to get the column name and the value. Pivot this to reshape back to one row per original index.
(df[0].str.split(',', expand=True).stack()
.str.extractall(r'(?P<col>.*)=(?P<val>.*)')
.reset_index([-1,-2], drop=True)
.pivot(columns='col', values='val')
.rename_axis(columns=None))
age education native-country race
0 Middle-aged Bachelors United-States White
1 Middle-aged Bachelors United-States White
d = {0: {0: 'age=Middle-aged,education=Bachelors,native-country=United-States,race=White',
1: 'age=Middle-aged,education=Bachelors,native-country=United-States,race=White'}}
how about splitting with = and then taking the last element of list using pandas applymap
function?
for example: this should do it.
df = df.applymap(lambda x: x.split('=')[-1])
age education
0 Middle-aged Bachelors
1 Middle-aged Bachelors
There are a number of ways to do this. If you know the column names, the simplest is way is to use the converters
argument to read_csv()
. Pass in a dict mapping column names or number to a function. Here the function splits the string on the =
and returns the part on the right.
converters = {n:lambda s:s.split('=')[1] for n in range(3)}
pd.read_csv(f, converters=converters, header=None, names='age education native-country'.split())
Returns:
age education native-country
0 Middle-aged Bachelors United States
1 Middle-aged Bachelors United States
An alternative way to make progress on this is to make sure that the input file is a valid CSV-formatted file (if it possible to change the format of your temp.csv
file).
In a CSV file, the values in each cell are not prefixed with the column name so, the lines in the file should look like this Middle-aged,Bachelors,United-States,White
rather than this age=Middle-aged,education=Bachelors,native-country=United-States,race=White
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.