[英]rearranging column data in loop
I have a data.csv
file it contain concatenated data as given below.我有一个data.csv
文件,它包含如下所示的串联数据。 >
are the separator of the concatenated files. >
是连接文件的分隔符。
>
1.094 1.128 1.439
3.064 3.227 3.371
>
5.131 5.463 5.584
3.65 3.947 4.135
>
1.895 1.954 2.492
5.307 5.589 5.839
I want to rearrange the column data side by side and finally wants to save to new text file as depicted below.For this demo example we can create three files.Moreover extra 0 and 5 should be appended as extra rows.我想并排重新排列列数据,最后想保存到新的文本文件,如下图所示。对于这个演示示例,我们可以创建三个文件。此外,额外的 0 和 5 应该作为额外的行附加。
cat file1.txt
1.094 5.131 1.895 0 5
3.064 3.65 5.307 0 5
cat file2.txt
1.128 5.463 1.954 0 5
3.227 3.947 5.589 0 5
cat file3.txt
1.439 5.584 2.492 0 5
3.371 4.135 5.839 0 5
My trial code我的试用码
import pandas as pd
df = pd.read_csv('data.csv', sep='\t')
for columns in df:
data=df.iloc[:,columns]
data.concat['data']
data.to_csv('file1.txt')
error: ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types错误:ValueError:基于位置的索引只能有 [integer、integer slice (START point is INCLUDED, END point is EXCLUDED)、listlike of integers、boolean array] 类型
I am not getting the expected output.Hope experts may help me.我没有得到预期的输出。希望专家可以帮助我。 Thanks.谢谢。
Assumptions:假设:
>
delimited block has 2 data rows每个>
分隔的块有 2 个数据行3
in the provided sample input)数据行可以包含可变数量的列(在提供的示例输入中为3
)3
in the provided sample input)所有数据行都具有相同的列数(在提供的示例输入中为3
)fileI.txt
where I
ranges from 1
to the number of columns in an input data row ( 3
in the provided sample data)输出文件名的格式fileI.txt
,其中I
的范围从1
到输入数据行中的列数(在提供的示例数据中为3
)awk
arrays) OP 的主机有足够的 RAM 来将整个输入文件保存在内存中(通过awk
数组) One awk
idea:一个awk
的想法:
awk '
/^>/ { next }
{ if (! colcnt) colcnt=NF # make note of number of columns; used to deep track of number of output files
for (i=1;i<=colcnt;i++)
row1[i]=row1[i] (row1[i] ? OFS : "") $i
getline
for (i=1;i<=colcnt;i++)
row2[i]=row2[i] (row2[i] ? OFS : "") $i
}
END { for (i=1;i<=colcnt;i++) {
print row1[i],0,5 > "file" i ".txt"
print row2[i],0,5 > "file" i ".txt"
}
}
' data.csv
NOTE: OP's sample code implies tab ( \t
) delimited input but additional comments from OP seem to indicate data is (variable) space delimited;注意: OP 的示例代码暗示制表符( \t
)分隔的输入,但来自 OP 的附加注释似乎表明数据是(变量)空格分隔的; input/output delimiters can be changed if/when OP provides an updated requirement for input/output delimiters如果/当 OP 提供对输入/输出分隔符的更新要求时,可以更改输入/输出分隔符
This generates:这会产生:
$ head file*.txt
==> file1.txt <==
1.094 5.131 1.895 0 5
3.064 3.65 5.307 0 5
==> file2.txt <==
1.128 5.463 1.954 0 5
3.227 3.947 5.589 0 5
==> file3.txt <==
1.439 5.584 2.492 0 5
3.371 4.135 5.839 0 5
Another solution using jq .使用jq的另一种解决方案。
Assumptions: Unix line endings, data.csv starts with line containing only ">" and ends with an empty line.假设:Unix 行结尾,data.csv 以仅包含“>”的行开头并以空行结尾。
for column in 1 2 3; do jq -Rsr --argjson column $column '
split(">\n")[1:]
|map(
split("\n")[:-1]|map(split("\t"))
) as $arr
|[
[ $arr[][0][$column-1]],[$arr[][1][$column-1] ]
|.+["0","5"]
|join("\t")
]|join("\n")
' data.csv > file$column.txt; done
Result:结果:
$ head file*.txt
==> file1.txt <==
1.094 5.131 1.895 0 5
3.064 3.65 5.307 0 5
==> file2.txt <==
1.128 5.463 1.954 0 5
3.227 3.947 5.589 0 5
==> file3.txt <==
1.439 5.584 2.492 0 5
3.371 4.135 5.839 0 5
To do this in python, use numpy
.要在 python 中执行此操作,请使用numpy
。 The following code should work I think, regardless of how many columns there are in the original file ( 3
in your example).我认为以下代码应该可以工作,无论原始文件中有多少列(在您的示例中为3
)。 It does assume blocks of 2
rows.它确实假设2
行的块。 Code below updated to take into account that the original file was in fact not separated by tabs
, as initially suggested .下面的代码已更新,以考虑到原始文件实际上并未像最初建议的那样由tabs
分隔。
import pandas as pd
import numpy as np
fname = "data.txt" # file is apparently only separated with spaces, and then
# (one assumes) only for those lines that include data (not the lines with ">")
# some minor adjustments:
df = pd.read_csv(fname, header=None)
# get rid of rows with ">" separator
df = df[~df[0].str.contains('>')]
# now split all remaining rows
df = df[0].str.split(expand=True)
# change dtype (first col will be dtype "object" due to ">" separator)
df = df.astype(float)
col_len = len(df.columns)
# add some data
df2 = pd.DataFrame(np.array([[0]*(col_len)*2,[5]*(col_len)*2]).reshape(4,col_len))
# concat orig data + added data
df_col = pd.concat([df, df2], ignore_index=True)
# convert to numpy array, and reshape
arr = df_col.to_numpy().reshape(int(df_col.shape[0]/2),2,col_len).T
# split up again
tup = np.split(arr,col_len)
# loop through tuple and write away the files
for idx, elem in enumerate(tup):
# numpy arr will be nested, so get elem[0]:
np.savetxt(f'file{idx+1}.txt', X=elem[0], fmt='%1.3f', delimiter='\t')
Result of print(elem[0])
in the last loop:最后一个循环中print(elem[0])
的结果:
[[1.094 5.131 1.895 0. 5. ]
[3.064 3.65 5.307 0. 5. ]]
[[1.128 5.463 1.954 0. 5. ]
[3.227 3.947 5.589 0. 5. ]]
[[1.439 5.584 2.492 0. 5. ]
[3.371 4.135 5.839 0. 5. ]]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.