简体   繁体   English

在循环中重新排列列数据

[英]rearranging column data in loop

I have a data.csv file it contain concatenated data as given below.我有一个data.csv文件,它包含如下所示的串联数据。 > are the separator of the concatenated files. >是连接文件的分隔符。

>
1.094   1.128   1.439
3.064   3.227   3.371
>
5.131   5.463   5.584
3.65    3.947   4.135
>
1.895   1.954   2.492
5.307   5.589   5.839

I want to rearrange the column data side by side and finally wants to save to new text file as depicted below.For this demo example we can create three files.Moreover extra 0 and 5 should be appended as extra rows.我想并排重新排列列数据,最后想保存到新的文本文件,如下图所示。对于这个演示示例,我们可以创建三个文件。此外,额外的 0 和 5 应该作为额外的行附加。

cat file1.txt
1.094  5.131  1.895 0 5
3.064  3.65   5.307 0 5

cat file2.txt
1.128  5.463  1.954 0 5
3.227  3.947  5.589 0 5

cat file3.txt
1.439  5.584  2.492 0 5
3.371  4.135  5.839 0 5

My trial code我的试用码

import pandas as pd
df = pd.read_csv('data.csv', sep='\t')
for columns in df:
    data=df.iloc[:,columns]
data.concat['data']
data.to_csv('file1.txt')

error: ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types错误:ValueError:基于位置的索引只能有 [integer、integer slice (START point is INCLUDED, END point is EXCLUDED)、listlike of integers、boolean array] 类型

I am not getting the expected output.Hope experts may help me.我没有得到预期的输出。希望专家可以帮助我。 Thanks.谢谢。

Assumptions:假设:

  • each > delimited block has 2 data rows每个>分隔的块有 2 个数据行
  • data rows can contain a variable number of columns ( 3 in the provided sample input)数据行可以包含可变数量的列(在提供的示例输入中为3
  • all data rows have the same number of columns ( 3 in the provided sample input)所有数据行都具有相同的列数(在提供的示例输入中为3
  • output file names are of the form fileI.txt where I ranges from 1 to the number of columns in an input data row ( 3 in the provided sample data)输出文件名的格式fileI.txt ,其中I的范围从1到输入数据行中的列数(在提供的示例数据中为3
  • OP's host has enough RAM to hold the entire input file in memory (via awk arrays) OP 的主机有足够的 RAM 来将整个输入文件保存在内存中(通过awk数组)

One awk idea:一个awk的想法:

awk '
/^>/   { next }
       { if (! colcnt) colcnt=NF                         # make note of number of columns; used to deep track of number of output files
         for (i=1;i<=colcnt;i++)
             row1[i]=row1[i] (row1[i] ? OFS : "") $i
         getline
         for (i=1;i<=colcnt;i++)
             row2[i]=row2[i] (row2[i] ? OFS : "") $i
       }
END    { for (i=1;i<=colcnt;i++) {
             print row1[i],0,5 > "file" i ".txt"
             print row2[i],0,5 > "file" i ".txt"
         }
       }
' data.csv

NOTE: OP's sample code implies tab ( \t ) delimited input but additional comments from OP seem to indicate data is (variable) space delimited;注意: OP 的示例代码暗示制表符( \t )分隔的输入,但来自 OP 的附加注释似乎表明数据是(变量)空格分隔的; input/output delimiters can be changed if/when OP provides an updated requirement for input/output delimiters如果/当 OP 提供对输入/输出分隔符的更新要求时,可以更改输入/输出分隔符

This generates:这会产生:

$ head file*.txt
==> file1.txt <==
1.094 5.131 1.895 0 5
3.064 3.65 5.307 0 5

==> file2.txt <==
1.128 5.463 1.954 0 5
3.227 3.947 5.589 0 5

==> file3.txt <==
1.439 5.584 2.492 0 5
3.371 4.135 5.839 0 5

Another solution using jq .使用jq的另一种解决方案。
Assumptions: Unix line endings, data.csv starts with line containing only ">" and ends with an empty line.假设:Unix 行结尾,data.csv 以仅包含“>”的行开头并以空行结尾。

for column in 1 2 3; do jq -Rsr --argjson column $column '
    split(">\n")[1:]
    |map(  
        split("\n")[:-1]|map(split("\t"))  
    ) as $arr
    |[ 
         [ $arr[][0][$column-1]],[$arr[][1][$column-1] ] 
         |.+["0","5"] 
         |join("\t")  
    ]|join("\n") 
' data.csv > file$column.txt; done

Result:结果:

$ head file*.txt
==> file1.txt <==
1.094   5.131   1.895   0       5
3.064   3.65    5.307   0       5

==> file2.txt <==
1.128   5.463   1.954   0       5
3.227   3.947   5.589   0       5

==> file3.txt <==
1.439   5.584   2.492   0       5
3.371   4.135   5.839   0       5

To do this in python, use numpy .要在 python 中执行此操作,请使用numpy The following code should work I think, regardless of how many columns there are in the original file ( 3 in your example).我认为以下代码应该可以工作,无论原始文件中有多少列(在您的示例中为3 )。 It does assume blocks of 2 rows.它确实假设2行的块。 Code below updated to take into account that the original file was in fact not separated by tabs , as initially suggested .下面的代码已更新,以考虑到原始文件实际上并未像最初建议的那样由tabs分隔

import pandas as pd
import numpy as np

fname = "data.txt" # file is apparently only separated with spaces, and then
# (one assumes) only for those lines that include data (not the lines with ">")
# some minor adjustments:

df = pd.read_csv(fname, header=None)

# get rid of rows with ">" separator
df = df[~df[0].str.contains('>')]

# now split all remaining rows
df = df[0].str.split(expand=True)

# change dtype (first col will be dtype "object" due to ">" separator)
df = df.astype(float)

col_len = len(df.columns)

# add some data
df2 = pd.DataFrame(np.array([[0]*(col_len)*2,[5]*(col_len)*2]).reshape(4,col_len))

# concat orig data + added data
df_col = pd.concat([df, df2], ignore_index=True)

# convert to numpy array, and reshape 
arr = df_col.to_numpy().reshape(int(df_col.shape[0]/2),2,col_len).T

# split up again
tup = np.split(arr,col_len)

# loop through tuple and write away the files
for idx, elem in enumerate(tup):
    # numpy arr will be nested, so get elem[0]:
    np.savetxt(f'file{idx+1}.txt', X=elem[0], fmt='%1.3f', delimiter='\t')

Result of print(elem[0]) in the last loop:最后一个循环中print(elem[0])的结果:

[[1.094 5.131 1.895 0.    5.   ]
 [3.064 3.65  5.307 0.    5.   ]]
[[1.128 5.463 1.954 0.    5.   ]
 [3.227 3.947 5.589 0.    5.   ]]
[[1.439 5.584 2.492 0.    5.   ]
 [3.371 4.135 5.839 0.    5.   ]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM