简体   繁体   中英

How to convert .dat to .csv using python?

I have a file.dat which looks like:

id       | user_id | venue_id | latitude  | longitude | created_at

---------+---------+----------+-----------+-----------+-----------------

984301   |2041916  |5222      |           |           |2012-04-21 17:39:01

984222   |15824    |5222      |38.8951118 |-77.0363658|2012-04-21 17:43:47

984315   |1764391  |5222      |           |           |2012-04-21 17:37:18

984234   |44652    |5222      |33.800745  |-84.41052  | 2012-04-21 17:43:43

I need to get csv file with deleted empty latitude and longtitude rows, like:

id,user_id,venue_id,latitude,longitude,created_at

984222,15824,5222,38.8951118,-77.0363658,2012-04-21T17:43:47

984234,44652,5222,33.800745,-84.41052,2012-04-21T17:43:43

984291,105054,5222,45.5234515,-122.6762071,2012-04-21T17:39:22

I try to do that, using next code:

with open('file.dat', 'r') as input_file:
    lines = input_file.readlines()
    newLines = []
    for line in lines:
        newLine = line.strip('|').split()
        newLines.append(newLine)

with open('file.csv', 'w') as output_file:
    file_writer = csv.writer(output_file)
    file_writer.writerows(newLines)

But all the same I get a csv file with "|"symbols and empty latitude/longtitude rows. Where is mistake? In general I need to use resulting csv-file in DateFrame, so maybe there is some way to reduce number of actions.

str.strip() removes leading and trailing characters from a string.
You want to split the lines on "|", then strip each element of the resulting list:

import csv

with open('file.dat') as dat_file, open('file.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file)

    for line in dat_file:
        row = [field.strip() for field in line.split('|')]
        if len(row) == 6 and row[3] and row[4]:
            csv_writer.writerow(row)

Use this:

data = pd.read_csv('file.dat', sep='|', header=0, skipinitialspace=True)
data.dropna(inplace=True)

I used standard python features without pre-processing data. I got an idea from one of the previous answers and improved it. If data headers contain spaces (it is often the situation in CSV) we should determine column names by ourselves and skip line 1 with headers. After that, we can remove NaN values only by specific columns.

data = pd.read_csv("checkins.dat", sep='|', header=None, skiprows=1,
                   low_memory = False, skipinitialspace=True,
                   names=['id','user_id','venue_id','latitude','longitude','created_at'])
data.dropna(subset=['latitude', 'longitude'], inplace = True)

Using split() without parameters will result in splitting after a space example "test1 test2".split() results in ["test1", "test2"]

instead, try this:

newLine = line.split("|")

Maybe it's better to use a map() function instead of list comprehensions as it must be working faster. Also writing a csv-file is easy with csv module.

import csv
with open('file.dat', 'r') as fin:
with open('file.csv', 'w') as fout:
    for line in fin:
        newline = map(str.strip, line.split('|'))
        if len(newline) == 6 and newline[3] and newline[4]:
            csv.writer(fout).writerow(newline)
with open("filename.dat") as f:
    with open("filename.csv", "w") as f1:
        for line in f:
            f1.write(line)

This can be used to convert a .dat file to .csv file

Combining previous answers I wrote my code for Python 2.7 :

import csv

lat_index = 3
lon_index = 4
fields_num = 6
csv_counter = 0

with open("checkins.dat") as dat_file:
    with open("checkins.csv", "w") as csv_file:
        csv_writer = csv.writer(csv_file)
        for dat_line in dat_file:
            new_line = map(str.strip, dat_line.split('|'))
            if len(new_line) == fields_num and new_line[lat_index] and new_line[lon_index]:
                csv_writer.writerow(new_line)
                csv_counter += 1

print("Done. Total rows written: {:,}".format(csv_counter))

这对我有用:

data = pd.read_csv('file.dat',sep='::',names=list_for_names_of_columns)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM