简体   繁体   English

交换两列 - awk、sed、python、perl

[英]Swap two columns - awk, sed, python, perl

I've got data in a large file (280 columns wide, 7 million lines long.) and I need to swap the first two columns, I think I could do this with some kind of awk for loop, to print $2, $1, then a range to the end of the file - but I don't know how to do the range part, and I can't print $2, $1.我有一个大文件中的数据(280 列宽,700 万行长。)我需要交换前两列,我想我可以用某种 awk for 循环来做到这一点,打印 $2, $1,然后是文件末尾的范围 - 但我不知道如何做范围部分,我无法打印 $2、$1。 $3..,$280. 3 美元,280 美元。 Most of the column swap answers I've seen here are specific to small files with a manageable number of columns, so I need something that doesn't depend on specifying every column number.我在这里看到的大多数列交换答案都特定于具有可管理列数的小文件,因此我需要一些不依赖于指定每个列号的东西。

The file is tab delimited:该文件以制表符分隔:

Affy-id chr 0 pos NA06984 NA06985 NA06986 NA06989

您可以通过交换前两个字段的值来做到这一点:

awk ' { t = $1; $1 = $2; $2 = t; print; } ' input_file

I tried the answer of perreal with cygwin on a windows system with a tab separated file.我在带有制表符分隔文件的 Windows 系统上使用 cygwin 尝试了 perreal 的答案。 It didn't work, because the standard separator is space.它不起作用,因为标准分隔符是空格。

If you encounter the same problem, try this instead:如果您遇到同样的问题,请尝试以下操作:

awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file

Incoming separator is defined by -F $'\\t' and the seperator for output by OFS=$'\\t' .传入分隔符由-F $'\\t'定义,输出分隔符由OFS=$'\\t'

awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file > output_file

尝试与您的问题更相关:

awk '{printf("%s\t%s\n", $2, $1)}' inputfile

这可能对你有用(GNU sed):

sed -i 's/^\([^\t]*\t\)\([^\t]*\t\)/\2\1/' file

Have you tried using the cut command?您是否尝试过使用 cut 命令? Eg例如

cat myhugefile | cut -c10-20,c1-9,c21- > myrearrangedhugefile

这在 perl 中也很容易:

perl -pe 's/^(\S+)\t(\S+)/$2\t$1/;' file > outputfile

You could do this in Perl:你可以在 Perl 中做到这一点:

perl -F\\t -nlae 'print join("\t", @F[1,0,2..$#F])' inputfile

The -F specifies the delimiter. -F指定分隔符。 In most shells you need to precede a backslash with another to escape it.在大多数 shell 中,您需要在另一个反斜杠前面加上反斜杠才能转义它。 On some platforms -F automatically implies -n and -a so they can be dropped.在某些平台上-F自动暗示-n-a因此可以删除它们。

For your problem you wouldn't need to use -l because the last columns appears last in the output.对于您的问题,您不需要使用-l因为最后一列出现在输出的最后。 But if in a different situation, if the last column needs to appear between other columns, the newline character must be removed.但是如果在不同的情况下,如果最后一列需要出现在其他列之间,则必须删除换行符。 The -l switch takes care of this. -l开关负责解决这个问题。

The "\\t" in join can be changed to anything else to produce a different delimiter in the output. join 中的"\\t"可以更改为任何其他内容,以在输出中生成不同的分隔符。

2..$#F specifies a range from 2 until the last column. 2..$#F指定从 2 到最后一列的范围。 As you might have guessed, inside the square brackets, you can put any single column or range of columns in the desired order.正如您可能已经猜到的那样,在方括号内,您可以按所需顺序放置任何单个列或列范围。

No need to call anything else but your shell:除了您的外壳,无需调用其他任何东西:

bash> while read col1 col2 rest; do 
        echo $col2 $col1 $rest
      done <input_file

Test:测试:

bash> echo "first second a c d e f g" | 
      while read col1 col2 rest; do 
        echo $col2 $col1 $rest
      done
second first a b c d e f g

Maybe even with "inlined" Python - as in a Python script within a shell script - but only if you want to do some more scripting with Bash beforehand or afterwards... Otherwise it is unnecessarily complex.甚至可能使用“内联” Python——就像在 shell 脚本中的 Python 脚本一样——前提是你想事先或之后用 Bash 做一些更多的脚本......否则它会不必要地复杂。

Content of script file process.sh :脚本文件process.sh内容:

#!/bin/bash

# inline Python script
read -r -d '' PYSCR << EOSCR
from __future__ import print_function
import codecs
import sys

encoding = "utf-8"
fn_in = sys.argv[1]
fn_out = sys.argv[2]

# print("Input:", fn_in)
# print("Output:", fn_out)

with codecs.open(fn_in, "r", encoding) as fp_in, \
        codecs.open(fn_out, "w", encoding) as fp_out:
    for line in fp_in:
        # split into two columns and rest
        col1, col2, rest = line.split("\t", 2)
        # swap columns in output
        fp_out.write("{}\t{}\t{}".format(col2, col1, rest))
EOSCR

# ---------------------
# do setup work?
# e. g. list files for processing

# call python script with params
python3 -c "$PYSCR" "$inputfile" "$outputfile"

# do some more processing
# e. g. rename outputfile to inputfile, ...

If you only need to swap the columns for a single file, then you can also just create a single Python script and statically define the filenames.如果您只需要为单个文件交换列,那么您也可以只创建一个 Python 脚本并静态定义文件名。 Or just use an answer above.或者只是使用上面的答案。

awk swapping sans temp-variable : awk交换无临时变量

echo '777777744444444464449: 317 647 14423 262927714037  :   0x2A29D5A1BAA7A95541' | 
 mawk '1; ($1 = $2 substr(_, ($2 = $1)^_))^_' FS=':' OFS=':'
777777744444444464449: 317 647 14423 262927714037  :   0x2A29D5A1BAA7A95541

 317 647 14423 262927714037  :777777744444444464449:   0x2A29D5A1BAA7A95541

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM