I have a txt file that looks like this:
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area
I need to create a csv file from this, which I have little experience in, but have been learning and doing it, although likely not efficiently.
My issue right now is that when I use pandas, it is creating columns after the ",". What I need is the column separator to be after the code on the left, "MT0113820000000", and although the codes do change, they are all the same length.
Thanks in advance, I know this is a really noobie question.
Here's my code currently:
import pandas as pd
dataframe1 = pd.read_csv("C:/Users/andre/Desktop/bea_api_test/python-bureau-economic-analysis-api-client/testttt/output.txt")
dataframe1.to_csv('output_.csv', index = None)
And the output:
COLUMN 1 COLUMN 2
MT0111500000000 Anniston-Oxford-Jacksonville | AL Metropolitan Statistical Area
Alternatively, using read_fwf
as mentioned in a comment above:
from io import StringIO
import pandas as pd
testdata = '''\
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area
'''
buff = StringIO(testdata)
df = pd.read_fwf(buff, header=None, colspecs=[(0, 15), (16, 64 * 1024)])
print(df.to_csv(index=False, columns=[0, 1], header=['COLUMN1', 'COLUMN2']))
That's not a CSV and I don't see a convenient way of convincing read_csv
to do the right thing. Luckily, there seems to be an easy rule here. The stuff before the first space, then the stuff after. str.split
does that.
import pandas as pd
from pathlib import Path
#in_file = Path("C:/Users/andre/Desktop/bea_api_test/python-bureau-economic-analysis-api-client/testttt/output.txt")
in_file = Path("test.txt")
out_file = in_file.with_name(in_file.stem + "_").with_suffix(".csv")
# test data
open(in_file, "w").write("""\
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area""")
# convert to csv
pd.DataFrame([line.strip().split(" ",1) for line in open(in_file)],
columns=["COLUMN1", "COLUMN2"]).to_csv(out_file, index=None, headr=False)
# visual verification
print(open(out_file).read())
Output
MT0111500000000,"Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area"
MT0112220000000,"Auburn-Opelika, AL Metropolitan Statistical Area"
MT0113820000000,"Birmingham-Hoover, AL Metropolitan Statistical Area"
In this example I immediately wrote the csv so that the dataframe is automatically deleted from memory. You could also do this with the CSV module, writing line at a time. This will use less memory because it don't have to hold the entire file in memory. And since csv
is part of the standard python library, there is no external dependency on pandas
. Adding a bit of file name handling
import csv
from pathlib import Path
#in_file = Path("C:/Users/andre/Desktop/bea_api_test/python-bureau-economic-analysis-api-client/testttt/output.txt")
in_file = Path("test.txt")
out_file = in_file.with_name(in_file.stem + "_").with_suffix(".csv")
# test data
open(in_file, "w").write("""\
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area""")
# convert to csv
with open(in_file) as infp, open(out_file, "w") as outfp:
writer = csv.writer(outfp)
writer.writerows(line.strip().split(" ",1) for line in infp)
# visual verification
print(open(out_file).read())
You can split the data at the first occurrence of the whitespace:
data = pd.read_table("data.txt", squeeze = True, header = None).str.split(" ", 1)
df = pd.DataFrame(data.tolist(), columns = ["column1", "column2"])
df.to_csv("df.csv")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.