My CSV file has 3 columns: Name,Age and Sex and sample data is:
AlexÇ39ÇM #Ç#SheebaÇ35ÇF #Ç#RiyaÇ10ÇF
The column delimiter is 'Ç' and record delimiter is '#Ç#'. Note the first record don't have the record delimiter(#Ç#), but all other records have record delimiter(#Ç#). Could you please tell me how to read this file and store it in a dataframe?
Both csv
and pandas
module support reading csv-files directly. However, since you need to modify the file contents line by line before further processing, I suggest reading the file line by line, modify each line as desired and store the processed data in a list for further handling.
The necessary steps include:
readlines()
Since .split()
returns a list of string elements we get an overall list of lists, where each 'sub-list' contains/represents the data of a line/record. Data formatted like this can be read by pandas.DataFrame.from_records()
which comes in quite handy at this point:
import pandas as pd
with open('myData.csv') as file:
# `.strip()` removes newline character from each line
# `.replace('#;#', '')` removes '#;#' from each line
# `.split(';')` splits at given string and returns a list with the string elements
lines = [line.strip().replace('#;#', '').split(';') for line in file.readlines()]
df = pd.DataFrame.from_records(lines, columns=['Name', 'Age', 'Sex'])
print(df)
Remarks:
I changed Ç
to ;
which worked better for me due to encoding issues. However, the basic idea of the algorithm is still the same.
Reading data manually like this can become quite resource-intensive which might be a problem when handling larger files. There might be more elegant ways, which I am not aware of. When getting problems with performance, try to read the file in chunks or have a look for more effective implementations.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.