简体   繁体   中英

Trying to import all columns from a csv with an object data type with pandas

I'm trying to read a csv into a new dataframe with pandas . A number of the columns may only contain numeric values, but I still want to have them imported in as strings/objects, rather than having columns of float type.

I'm trying to write some python scripts for data conversion/migration. I'm not an advanced Python programmer and mostly learning as I come across a problem that needs solving.

The csvs I am importing have varying number of columns, and even different column titles, and in any order, over which I have no control, so I can't explicitly specify data types using the dtype parameter with read_csv . I just want any column imported to be treated as an object data type so I can analyse it further for data quality.

Examples would be 'Staff ID' , and 'License Number' columns on one CSV I tried which should be strings fields holding 7-digit IDs, being imported as type float64.

I have tried using astype with read_csv and apply map on the dataframe after import

Note, there is no hard and fast rule on the contents of the type or quality of the data which is why I want to always import them as dtype of object.

Thanks in advance for anyone who can help me figure this out.

I've used the following code to read it in.

import pandas as pd
df = pd.read_csv("agent.csv",encoding="ISO-8859-1")

This creates the 'License Number' column in df with a type of float64 (among others).

Here is an example of License Number which should be a string:

'1275595' being stored as 1275595.0

Converting it back to a string/object in df after the import changes it back to '1275595.0'

It should stop converting data.

pd.read_csv(..., dtype=str)

Doc: read_csv

dtype: ...  Use str or object together with suitable na_values settings 
            to preserve and not interpret dtype. 

I recommend you split your csv reading process into multiple, specific-purpose functions.

For example:

import pandas as pd

# Base function for reading a csv. All the parsing/formatting is done here
def read_csv(file_content, header=False, columns=None, encoding='utf-8'):
    df = pd.read_csv(file_content, header=header, encoding=encoding)
    df.columns = columns
    return df

# Function with a specific purpose as stated in the name.
def read_csv_license_plates(file_content, encoding='utf-8'):
    columns = ['col1', 'col2', 'col3']
    df = read_csv(file_content, True, columns)
    return df

read_csv_license_plates('agent.csv', encoding='ISO-8859-1')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM