[英]Read the text file and split into multiple files based unique code present in the first column
Read the text file and split into multiple files based on the unique code present in the first column of text file- Column structure will be different for each record based on the unique code identifier in first column. 读取文本文件,然后根据文本文件第一列中存在的唯一代码将其拆分为多个文件-基于第一列中的唯一代码标识符,每条记录的列结构都会有所不同。
Text file with comma separator 带有逗号分隔符的文本文件
Sample input file structure
"05555", "AB", "CC", "DD", "EE", "USA"
"05555", "AB", "CC", "DD", "EE", "CA"
"05555", "AB", "CC", "DD", "EE", "NY"
"0666666", "AB", "CC", "DD", "EE", "NY", "123", "567", "888"
"0666666", "AB", "CC", "DD", "EE", "USA", "123", "567", "999"
I would like to split the above text file into text file based on the unique code identifier in the first column. 我想根据第一列中的唯一代码标识符将上述文本文件拆分为文本文件。
Expected two file with data as below 预期两个文件的数据如下
File1
"05555", "AB", "CC", "DD", "EE", "USA"
"05555", "AB", "CC", "DD", "EE", "CA"
"05555", "AB", "CC", "DD", "EE", "NY"
file2
"0666666", "AB", "CC", "DD", "EE", "NY", "123", "567", "888"
"0666666", "AB", "CC", "DD", "EE", "USA", "123", "567", "999"
Note: As structure of different for each code identifier, I'm not able to read the data into pandas dataframes 注意:由于每个代码标识符的结构不同,因此我无法将数据读取到pandas数据帧中
You question contain two parts,1st read the file with unbalanced row , then split the dataframe to sub-dfs 您的问题包括两部分:首先读取具有不平衡行的文件,然后将数据帧拆分为sub-dfs
import pandas, io
data = io.StringIO('''"05555", "AB", "CC", "DD", "EE", "USA"
"05555", "AB", "CC", "DD", "EE", "CA"
"05555", "AB", "CC", "DD", "EE", "NY"
"0666666", "AB", "CC", "DD", "EE", "NY", "123", "567", "888"
"0666666", "AB", "CC", "DD", "EE", "USA", "123", "567", "999"
''')
df = pd.read_csv(data,sep=';',header=None)
s=df[0].str.split(',',expand=True)
s=s.apply(lambda x : x.str.strip(' "'),1)
for x , y in s.groupby(0):
print(y.dropna(1))
y.dropna(1).to_csv(str(x)+'.csv')
0 1 2 3 4 5
0 05555 AB CC DD EE USA
1 05555 AB CC DD EE CA
2 05555 AB CC DD EE NY
0 1 2 3 4 5 6 7 8
3 0666666 AB CC DD EE NY 123 567 888
4 0666666 AB CC DD EE USA 123 567 999
Try using groupby
and an for
loop and then write the csv
s: 尝试使用groupby
和for
循环,然后编写csv
:
for i, (_, group) in enumerate(df.groupby(df.iloc[:, 0]), 1):
group.to_csv('File%s' % i)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.