[英]How to create a csv file from a txt file with column separator after “x” amount of characters
I have a txt file that looks like this:我有一个看起来像这样的 txt 文件:
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area
I need to create a csv file from this, which I have little experience in, but have been learning and doing it, although likely not efficiently.我需要从中创建一个 csv 文件,我对此几乎没有经验,但一直在学习和做,虽然可能效率不高。
My issue right now is that when I use pandas, it is creating columns after the ",".我现在的问题是,当我使用 pandas 时,它会在“,”之后创建列。 What I need is the column separator to be after the code on the left, "MT0113820000000", and although the codes do change, they are all the same length.
我需要的是列分隔符位于左侧代码“MT0113820000000”之后,尽管代码确实发生了变化,但它们的长度都相同。
Thanks in advance, I know this is a really noobie question.在此先感谢,我知道这是一个非常noobie的问题。
Here's my code currently:这是我目前的代码:
import pandas as pd
dataframe1 = pd.read_csv("C:/Users/andre/Desktop/bea_api_test/python-bureau-economic-analysis-api-client/testttt/output.txt")
dataframe1.to_csv('output_.csv', index = None)
And the output:和 output:
COLUMN 1 COLUMN 2
MT0111500000000 Anniston-Oxford-Jacksonville | AL Metropolitan Statistical Area
Alternatively, using read_fwf
as mentioned in a comment above:或者,使用上面评论中提到的
read_fwf
:
from io import StringIO
import pandas as pd
testdata = '''\
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area
'''
buff = StringIO(testdata)
df = pd.read_fwf(buff, header=None, colspecs=[(0, 15), (16, 64 * 1024)])
print(df.to_csv(index=False, columns=[0, 1], header=['COLUMN1', 'COLUMN2']))
That's not a CSV and I don't see a convenient way of convincing read_csv
to do the right thing.这不是 CSV 并且我看不到说服
read_csv
做正确事情的便捷方法。 Luckily, there seems to be an easy rule here.幸运的是,这里似乎有一个简单的规则。 The stuff before the first space, then the stuff after.
第一个空格之前的东西,然后是之后的东西。
str.split
does that. str.split
这样做的。
import pandas as pd
from pathlib import Path
#in_file = Path("C:/Users/andre/Desktop/bea_api_test/python-bureau-economic-analysis-api-client/testttt/output.txt")
in_file = Path("test.txt")
out_file = in_file.with_name(in_file.stem + "_").with_suffix(".csv")
# test data
open(in_file, "w").write("""\
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area""")
# convert to csv
pd.DataFrame([line.strip().split(" ",1) for line in open(in_file)],
columns=["COLUMN1", "COLUMN2"]).to_csv(out_file, index=None, headr=False)
# visual verification
print(open(out_file).read())
Output Output
MT0111500000000,"Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area"
MT0112220000000,"Auburn-Opelika, AL Metropolitan Statistical Area"
MT0113820000000,"Birmingham-Hoover, AL Metropolitan Statistical Area"
In this example I immediately wrote the csv so that the dataframe is automatically deleted from memory.在此示例中,我立即编写了 csv,以便自动从 memory 中删除 dataframe。 You could also do this with the CSV module, writing line at a time.
您也可以使用 CSV 模块执行此操作,一次写入一行。 This will use less memory because it don't have to hold the entire file in memory.
这将使用更少的 memory,因为它不必将整个文件保存在 memory 中。 And since
csv
is part of the standard python library, there is no external dependency on pandas
.由于
csv
是标准 python 库的一部分,因此对pandas
没有外部依赖。 Adding a bit of file name handling添加一些文件名处理
import csv
from pathlib import Path
#in_file = Path("C:/Users/andre/Desktop/bea_api_test/python-bureau-economic-analysis-api-client/testttt/output.txt")
in_file = Path("test.txt")
out_file = in_file.with_name(in_file.stem + "_").with_suffix(".csv")
# test data
open(in_file, "w").write("""\
MT0111500000000 Anniston-Oxford-Jacksonville, AL Metropolitan Statistical Area
MT0112220000000 Auburn-Opelika, AL Metropolitan Statistical Area
MT0113820000000 Birmingham-Hoover, AL Metropolitan Statistical Area""")
# convert to csv
with open(in_file) as infp, open(out_file, "w") as outfp:
writer = csv.writer(outfp)
writer.writerows(line.strip().split(" ",1) for line in infp)
# visual verification
print(open(out_file).read())
You can split the data at the first occurrence of the whitespace:您可以在第一次出现空格时拆分数据:
data = pd.read_table("data.txt", squeeze = True, header = None).str.split(" ", 1)
df = pd.DataFrame(data.tolist(), columns = ["column1", "column2"])
df.to_csv("df.csv")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.