删除换行符（\ n）但排除具有特定正则表达式的行？

Question

经过大量的搜索，我发现了一些使用sed或tr删除换行符的方法

sed ':a;N;$!ba;s/\n//g'

tr -d '\n'

但是，我找不到从特定行中排除操作的方法。 我已经知道可以使用“！” 在sed中作为从后续动作中排除地址的手段，但我无法弄清楚如何将其合并到上面的sed命令中。 这是我想要解决的一个例子。

我有一个格式如下的文件：

>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at

我希望以这种方式格式化文件：

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

我一直专注于尝试排除包含“>”字符的行，因为这是唯一存在于具有“>”字符的行上的常量正则表达式（注意：sequence_ID_n对于每个以“>”开头的条目是唯一的>“因此，不能依赖于正则表达式匹配）。

我试过这个：

sed ':a;N;$!ba;/^>/!s/\n//g' file.txt > file2.txt

它运行时不会产生错误，但输出文件与原始文件相同。

也许我不能用sed这样做？ 也许我正在接近这个问题？ 我是否应该尝试定义一系列要操作的线（即只有以“>”开头的线之间的线）？

我是基本文本操作的新手，所以任何建议都非常值得赞赏！

Answer 1

这awk应该工作：

$ awk '/^>/{print (NR==1)?$0:"\n"$0;next}{printf "%s", $0}END{print ""}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

Answer 2

这可能适合你（GNU sed）：

sed ':a;N;/^>/M!s/\n//;ta;P;D' file

从不以>开头的行中删除换行符。

Answer 3

正如@ 1_CR已经说@jaypal的解决方案是一个很好的方法。 但我真的无法抗拒在纯粹的Bash中尝试它。 有关详细信息，请参阅注释

输入数据：

$ cat input.txt
>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at
>sequence_ID_20
gattaca

剧本：

$ cat script
#!/usr/bin/env bash

# Bash 4 - read the data line by line into an array
readarray -t data < "$1"

# Bash 3 - read the data line by line into an array
#while read line; do
#    data+=("$line")
#done < "$1"

# A search pattern
pattern="^>sequence_ID_[0-9]"

# An array to insert the revised data
merged=()

# A counter
counter=0

# Iterate over each item in our data array
for item in "${data[@]}"; do

    # If an item matches the pattern
    if [[ "$item" =~ $pattern ]]; then

        # Add the item straight into our new array
        merged+=("$item")

        # Raise the counter in order to write the next
        # possible non-matching item to a new index
        (( counter++ ))

        # Continue the loop from the beginning - skip the
        # rest of the code inside the loop for now since it 
        # is not relevant after we have found a match.
        continue
    fi

    # If we have a match in our merged array then
    # raise the counter one more time in order to
    # get a new index position
    [[ "${merged[$counter]}" =~ $pattern ]] && (( counter++ ))

    # Add a non matching value to the already existing index
    # currently having the highest index value based on the counter
    merged[$counter]+="$item"
done

# Test: Echo each item of our merged array
printf "%s\n" "${merged[@]}"

结果：

$ ./script input.txt

>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
>sequence_ID_20
gattaca

Answer 4

使用GNU sed：

sed -r ':a;/^[^>]/{$!N;s/\n([^>])/\1/;ta}' inputfile

为了您的输入，它产生：

>sequence_ID_1
atcgatcgggatcatgacttcattgagaccgaga
>sequence_ID_2
gatccatggacgttaacgcgatgactactaggatcagt

Answer 5

Jaypal的解决方案是要走的路，这是一个GNU awk变种

awk -v RS='>sequence[^\\n]+\\n' 
'{gsub("\n", "");printf "%s%s%s", $0, NR==1?"":"\n", RT}'  file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

Answer 6

这是用awk做到这一点的一种方法

awk '{printf (/^>/&&NR>1?RS:"")"%s"(/^>/?RS:""),$0}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

删除换行符（\ n）但排除具有特定正则表达式的行？

问题描述

6 个解决方案

解决方案1
3 已采纳 2014-03-28 00:42:09

解决方案2
2 2014-03-28 06:04:36

解决方案3
1 2014-03-28 02:06:37

解决方案4
1 2014-03-28 02:48:30

解决方案5
0 2014-03-28 00:59:17

解决方案6
0 2014-03-28 06:39:13

删除换行符（\ n）但排除具有特定正则表达式的行？

问题描述

6 个解决方案

解决方案1 3 已采纳 2014-03-28 00:42:09

解决方案2 2 2014-03-28 06:04:36

解决方案3 1 2014-03-28 02:06:37

解决方案4 1 2014-03-28 02:48:30

解决方案5 0 2014-03-28 00:59:17

解决方案6 0 2014-03-28 06:39:13

解决方案1
3 已采纳 2014-03-28 00:42:09

解决方案2
2 2014-03-28 06:04:36

解决方案3
1 2014-03-28 02:06:37

解决方案4
1 2014-03-28 02:48:30

解决方案5
0 2014-03-28 00:59:17

解决方案6
0 2014-03-28 06:39:13