简体   繁体   中英

How to read in a csv file which has list of list columns values with certain rows having double quotes as strings

The csv file attached has 4 cols with an index field.

The fourth column is a list of list column. The records with one element are present as list [13455] and the records with multiple elements are present as "[13764,13455,13456]".

I want to remove the double quotes and read the last column as a list of list only. Please suggest me on how to do that.

I'm also trying to find the max value from the whole list of list.

In the sample case I'm trying to find 20930 which is the max value. Sample file image here

full_data1 = pd.DataFrame([]) 
   for gm_chunk1 in tqdm_notebook(pd.read_csv('CD_1000.csv',skipinitialspace = True, sep = ',', quotechar='"', usecols = ['ID','NBR','Day','CD'], chunksize=10000)):
      gm_chunk1 = gm_chunk1['CD'].apply(lambda x: x.strip('"'))
      gm_chunk1 = gm_chunk1.groupby(['ID'],as_index=False).agg(lambda x: list(x))
      full_data1 = full_data1.append(gm_chunk1)
      print(len(full_data1))
      print (50*'--')

The data is around 150 million records. I'm also trying to do a groupby based on the ID. The groupby seems to work fine. But I then realized that the last column became all characters and not list of list.

Here is a possible solution which can be applied to the applicable column once the DataFrame has been created from the CSV:

# Example dataframe:
df = pd.DataFrame(data={"col":[[13455], "[13764,13455,13456]"]})

# Solution
def convert_str(x):
    if isinstance(x, str):
        return eval(x)
    else:
        return x
df["col"] = df["col"].apply(lambda x:convert_str(x))

To get the maximum of the list of lists you can use this:

max(df["col"].apply(lambda l: max(l)))

Or an alternative just using list comprehension:

max([max(l) for l in df["col"]])

In your case, one of the problem is quotechar='"' with sep = ',' . Without the first, the , in your list will be used as seperator and pandas will throw an error. It would work great with another separator.

Using pandas:

import pandas as pd
import io
import ast

dframe=u"""0|123|[1]
1|234|"[2,3,4]"
2|345|"[3,4,5]" """

df = pd.read_csv(io.StringIO(dframe), sep='|', header=None)

# The actual solution to apply to the right column
df[2] = df[2].map(lambda x: ast.literal_eval(x))
print(df)

Result

   0    1          2
0  0  123  [1, 2, 3]
1  1  234  [2, 3, 4]
2  2  345  [3, 4, 5]

The third column is actually a list that you can iterate.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM