![](/img/trans.png)
[英]Regex - how to include URL, up to non-alphabet characters or certain characters
[英]Delete regex and non-alphabet characters with grep/awk/sed
我正在為短語生成模型中的文本輸入格式化語言語料庫。 現在語料庫本質上是一個長文本文件,其相關行如下所示:
*EXP: I didn't understand what you said .
*CHI: I know [!] &=laugh (.) .
我已經可以使用grep來獲取所有以'*'開頭的行。 我想要做的是刪除所有那些刪除了5個字符+標簽標題的行(刪除* EXP:或* CHI:或其他)並刪除所有非字母字符,如括號,parens和句點。 唯一的例外是撇號 - 我需要將撇號轉換為僅用於此模型的'@'符號。 另外,我想擺脫以'&'符號開頭的標記,因為它們是非單詞的話語。 所以我的目標輸出將是這樣的:
I didn@t understand what you said
I know
我是Unix文本操作的新手,所以我很感激任何幫助!
您可以使用cut
刪除前綴,例如:
$ cat corpus.txt | cut -c 9-
I didn't understand what you said .
I know [!] &=laugh (.) .
然后刪除非單詞標記,你可以像這樣使用sed
:
$ cat corpus.txt | cut -c 9- | sed 's/\&[^ ]*//g'
I didn't understand what you said .
I know [!] (.) .
最后,要刪除非字母符號並將撇號轉換為@
,您可以通過以下兩個步驟將其輸入sed
:
$ cat corpus.txt | cut -c 9- | sed 's/\&[^ ]*//g' | sed "s/[^a-zA-Z ']//g" | sed "s/'/@/g"
I didn@t understand what you said
I know
使用perl :
perl -lne '
/^\*\w{3}:\s+(.*)/ and do {
$_ = $1;
s/[^\w\s\047]//g;
s/\047/@/g;
print
}
' file
有解釋:
perl -lne ' # using -n is like while (<>) {}
# regex to match criterias & using capturing group for
# the interesting ending part :
/^\*\w{3}:\s+(.*)/ and do {
$_ = $1; # assigning the captured group on the default variable $_
s/[^\w\s\047]//g; # replace ponctuation chars by nothing
s/\047/@/g; # replace single quote with @
print # print the modified line
}
' file
輸出:
I didn@t understand what you said
I know laugh
這可能適合你(GNU sed):
sed 's/^.....\t//;s/&\S\+//g;y/'\''/\n/;s/[[:punct:]]//g;y/\n/@/' file
刪除行的前面,刪除話語,用換行符替換單引號,刪除標點並用@
替換換行符。
GNU awk 4.1
#!/usr/bin/awk -f
@include "join"
/^*/ {
gsub(/'/, "@")
gsub(/&=\S+/, "")
gsub(/[^[:alnum:][:blank:]@]/, "")
split($0, foo)
print join(foo, 2, NF)
}
sed -n "
# filter line with special starting pattern *AAA:Tab
/^\*[A-Z]\{3}:\t/ {
# remove head using last search pattern by default
s///
# change quote by @
y/'/@/
# remove token
s/\&=[^ ]*//g
# remove non alphabetic (maybe number have to be keep also ?) but @
s/[^a-zA-Z@]//g
# print only those line
p
}" YourFile
Posix版本(所以--posix
在gnu sed上--posix
)。 通過刪除評論並替換換行符可以成為OneLine ;
如果需要的話
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.