简体   繁体   中英

how to open csv in python?

I have a dataset in following format.

row_num;locale;day_of_week;hour_of_day;agent_id;entry_page;path_id_set;traffic_type;session_durantion;hits
"988681;L6;Monday;17;1;2111;""31672;0"";6;7037;\\N" "988680;L2;Thursday;22;10;2113;""31965;0"";2;49;14" "988679;L4;Saturday;21;2;2100;""0;78464"";1;1892;14" "988678;L3;Saturday;19;8;2113;51462;6;0;1;\\N"

I want it to be in following format :

row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N
988680 L2 Thursday 22 10 2113 31965 0 2 49 14
988679 L4 Saturday 21 2 2100 0 78464 1 1892 14
988678 L3 Saturday 19 8 2113 51462 6 0 1 N

I tried with the following code :

import pandas as pd

df = pd.read_csv("C:\Users\Rahhy\Desktop\trivago.csv", delimiter = ";")

But I am getting a error as :

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Using replace() :

with open("data_test.csv", "r") as fileObj:
    contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
print(contents)

OUTPUT :

row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N

EDIT :

You can open a file, read its content, replace the unwanted chars. write the new contents to the file and then read it through pd.read_csv :

with open("data_test.csv", "r") as fileObj:
    contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
# print(contents)

with open("data_test.csv", "w+") as fileObj2:
    fileObj2.write(contents)

import pandas as pd
df = pd.read_csv(r"data_test.csv", index_col=False)
print(df)

OUTPUT :

row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N
import pandas as pd
from io import StringIO

# Load the file to a string (prefix r (raw) to not use \ for escaping)
filename = r'c:\temp\x.csv'
with open(filename, 'r') as file:
    raw_file_content = file.read()

# Remove the quotes which break the CSV file
file_content_without_quotes = raw_file_content.replace('"','')

# Simulate a file with the corrected CSV content
simulated_file = StringIO(file_content_without_quotes)

# Get the CSV as a table with pandas
# Since the first field in each data row shall not be used for indexing we need to set index_col=False
csv_data = pd.read_csv(simulated_file, delimiter = ';', index_col=False)
print(csv_data['hits']) # print some column
csv_data

Since there are 11 data fields and 10 headers only the first 10 fields are used. You'll have to figure out what you want to do with the last one (Values: \\N, 14)

Output:

0    7037
1      49
2    1892
3       1

在此处输入图片说明

See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM