如何将这样的非结构化 csv 文件读取到 pandas dataframe？

Question

When I do:当我做：

data = pd.read_csv('temp.csv',sep = ',',header = None)

I got:我有：

0   age=Middle-aged,education=Bachelors,native-cou...
1   age=Middle-aged,education=Bachelors,native-cou...

The row number is correct, but how do I extracted the variable names(headers) such as age, education, native-country and use the value after '=' as the value for each header行号是正确的，但我如何提取变量名称（标题），如年龄、教育、本国，并使用“=”之后的值作为每个 header 的值

Answer 1

You can split those long strings on the commas and stack it all into one big Series.您可以将逗号上的那些长字符串split并将它们全部堆叠成一个大系列。 Then extract the fields around the '=' to get the column name and the value.然后提取'='周围的字段以获取列名和值。 Pivot this to reshape back to one row per original index. Pivot 这将重塑为每个原始索引一行。

(df[0].str.split(',', expand=True).stack()
      .str.extractall(r'(?P<col>.*)=(?P<val>.*)')
      .reset_index([-1,-2], drop=True)
      .pivot(columns='col', values='val')
      .rename_axis(columns=None))

           age  education native-country   race
0  Middle-aged  Bachelors  United-States  White
1  Middle-aged  Bachelors  United-States  White

Sample Data样本数据

d = {0: {0: 'age=Middle-aged,education=Bachelors,native-country=United-States,race=White', 
         1: 'age=Middle-aged,education=Bachelors,native-country=United-States,race=White'}}

Answer 2

how about splitting with = and then taking the last element of list using pandas applymap function?如何用 = 拆分，然后使用 pandas applymap获取列表的最后一个元素？

for example: this should do it.例如：这应该这样做。

df = df.applymap(lambda x: x.split('=')[-1])

           age  education
0  Middle-aged  Bachelors
1  Middle-aged  Bachelors

Answer 3

There are a number of ways to do this.有很多方法可以做到这一点。 If you know the column names, the simplest is way is to use the converters argument to read_csv() .如果您知道列名，最简单的方法是使用converters参数read_csv() 。 Pass in a dict mapping column names or number to a function.将 dict 映射列名称或编号传递给 function。 Here the function splits the string on the = and returns the part on the right.此处 function 将=上的字符串拆分并返回右侧的部分。

converters = {n:lambda s:s.split('=')[1] for n in range(3)}

pd.read_csv(f, converters=converters, header=None, names='age education native-country'.split())

Returns:回报：

    age         education   native-country
0   Middle-aged Bachelors   United States
1   Middle-aged Bachelors   United States

Answer 4

An alternative way to make progress on this is to make sure that the input file is a valid CSV-formatted file (if it possible to change the format of your temp.csv file).在这方面取得进展的另一种方法是确保输入文件是有效的 CSV 格式文件（如果可以更改temp.csv文件的格式）。

In a CSV file, the values in each cell are not prefixed with the column name so, the lines in the file should look like this Middle-aged,Bachelors,United-States,White rather than this age=Middle-aged,education=Bachelors,native-country=United-States,race=White .在 CSV 文件中，每个单元格中的值没有以列名作为前缀，因此，文件中的行应该看起来像这个Middle-aged,Bachelors,United-States,White而不是这个age=Middle-aged,education=Bachelors,native-country=United-States,race=White 。

如何将这样的非结构化 csv 文件读取到 pandas dataframe？

问题描述

4 个解决方案

解决方案1
1 已采纳 2020-07-29 20:07:43

Sample Data样本数据

解决方案2
0 2020-07-29 20:10:46

解决方案3
0 2020-07-29 21:41:41

解决方案4
-1 2020-07-29 20:11:25

如何将这样的非结构化 csv 文件读取到 pandas dataframe？

问题描述

4 个解决方案

解决方案1 1 已采纳 2020-07-29 20:07:43

Sample Data样本数据

解决方案2 0 2020-07-29 20:10:46

解决方案3 0 2020-07-29 21:41:41

解决方案4 -1 2020-07-29 20:11:25

解决方案1
1 已采纳 2020-07-29 20:07:43

解决方案2
0 2020-07-29 20:10:46

解决方案3
0 2020-07-29 21:41:41

解决方案4
-1 2020-07-29 20:11:25