简体   繁体   中英

How to create a csv file from a txt file with column separator after “x” amount of characters

I have a txt file that looks like this:

MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area  
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area

I need to create a csv file from this, which I have little experience in, but have been learning and doing it, although likely not efficiently.

My issue right now is that when I use pandas, it is creating columns after the ",". What I need is the column separator to be after the code on the left, "MT0113820000000", and although the codes do change, they are all the same length.

Thanks in advance, I know this is a really noobie question.

Here's my code currently:

import pandas as pd

dataframe1 = pd.read_csv("C:/Users/andre/Desktop/bea_api_test/python-bureau-economic-analysis-api-client/testttt/output.txt")  
dataframe1.to_csv('output_.csv', index = None)

And the output:

COLUMN 1                                COLUMN 2
MT0111500000000 Anniston-Oxford-Jacksonville     | AL Metropolitan Statistical Area

Alternatively, using read_fwf as mentioned in a comment above:

from io import StringIO
import pandas as pd

testdata = '''\
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area
'''

buff = StringIO(testdata)

df = pd.read_fwf(buff, header=None, colspecs=[(0, 15), (16, 64 * 1024)])

print(df.to_csv(index=False, columns=[0, 1], header=['COLUMN1', 'COLUMN2']))

That's not a CSV and I don't see a convenient way of convincing read_csv to do the right thing. Luckily, there seems to be an easy rule here. The stuff before the first space, then the stuff after. str.split does that.

import pandas as pd
from pathlib import Path

#in_file = Path("C:/Users/andre/Desktop/bea_api_test/python-bureau-economic-analysis-api-client/testttt/output.txt")
in_file = Path("test.txt")
out_file = in_file.with_name(in_file.stem + "_").with_suffix(".csv")

    # test data
    open(in_file, "w").write("""\
    MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
    MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area  
    MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area""")
    
    # convert to csv
    pd.DataFrame([line.strip().split(" ",1) for line in open(in_file)],
        columns=["COLUMN1", "COLUMN2"]).to_csv(out_file, index=None, headr=False)
    
    # visual verification
    print(open(out_file).read())

Output

MT0111500000000,"Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area"
MT0112220000000,"Auburn-Opelika, AL Metropolitan Statistical Area"
MT0113820000000,"Birmingham-Hoover, AL Metropolitan Statistical Area"

In this example I immediately wrote the csv so that the dataframe is automatically deleted from memory. You could also do this with the CSV module, writing line at a time. This will use less memory because it don't have to hold the entire file in memory. And since csv is part of the standard python library, there is no external dependency on pandas . Adding a bit of file name handling

import csv
from pathlib import Path

#in_file = Path("C:/Users/andre/Desktop/bea_api_test/python-bureau-economic-analysis-api-client/testttt/output.txt")
in_file = Path("test.txt")
out_file = in_file.with_name(in_file.stem + "_").with_suffix(".csv")

# test data
open(in_file, "w").write("""\
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area  
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area""")

# convert to csv
with open(in_file) as infp, open(out_file, "w") as outfp:
    writer = csv.writer(outfp)
    writer.writerows(line.strip().split(" ",1) for line in infp)

# visual verification
print(open(out_file).read())

You can split the data at the first occurrence of the whitespace:

data = pd.read_table("data.txt", squeeze = True, header = None).str.split(" ", 1)
df = pd.DataFrame(data.tolist(), columns = ["column1", "column2"])

df.to_csv("df.csv")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM