简体   繁体   中英

How to read csv file which has column delimiter as well record delimiter

My CSV file has 3 columns: Name,Age and Sex and sample data is:

AlexÇ39ÇM
#Ç#SheebaÇ35ÇF
#Ç#RiyaÇ10ÇF

The column delimiter is 'Ç' and record delimiter is '#Ç#'. Note the first record don't have the record delimiter(#Ç#), but all other records have record delimiter(#Ç#). Could you please tell me how to read this file and store it in a dataframe?

Both csv and pandas module support reading csv-files directly. However, since you need to modify the file contents line by line before further processing, I suggest reading the file line by line, modify each line as desired and store the processed data in a list for further handling.

The necessary steps include:

  • open file
  • read file line by line
  • remove newline char (which is part of the line when using readlines()
  • replace record delimiter (since a record is equivalent to a line)
  • split lines at column delimiter

Since .split() returns a list of string elements we get an overall list of lists, where each 'sub-list' contains/represents the data of a line/record. Data formatted like this can be read by pandas.DataFrame.from_records() which comes in quite handy at this point:

import pandas as pd

with open('myData.csv') as file:
    # `.strip()` removes newline character from each line
    # `.replace('#;#', '')` removes '#;#' from each line
    # `.split(';')` splits at given string and returns a list with the string elements
    lines = [line.strip().replace('#;#', '').split(';') for line in file.readlines()]

df = pd.DataFrame.from_records(lines, columns=['Name', 'Age', 'Sex'])

print(df)

Remarks:

  1. I changed Ç to ; which worked better for me due to encoding issues. However, the basic idea of the algorithm is still the same.

  2. Reading data manually like this can become quite resource-intensive which might be a problem when handling larger files. There might be more elegant ways, which I am not aware of. When getting problems with performance, try to read the file in chunks or have a look for more effective implementations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM