I am trying to list all the gz files from this website
site=http://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/rdf/
curl -s "$site" --list-only | sed -n 's%.*href="rdf/uni([^"]*\.rdf.gz)".*%\1%p'
But i am getting this error:
sed: -e expression #1, char 40: invalid reference \1 on `s' command's RHS
I would avoid regex
to parse html
. Here you have an alternative with perl and mojolicious
as parser:
perl -Mojo -E '
g(q|http://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/rdf/|)
->dom
->find(q|a|)
->each(sub {
my $t = $_->text;
say $t if $t =~ m/rdf\.gz\Z/
})'
But if you insist with sed , your regular expression has some problems. First, parentheses must be escaped to do grouping. Second, rdf/uni
is not a match. Third, when you do [^"]*
it is bypassing the extension rdf.gz
. Change it to look for a .
and then check the extension, but I remember that is very fragile. It could fail in many ways, for example with a file with a .
in its name:
curl -s "$site" --list-only | sed -n 's%.*href="\([^.]*\.rdf\.gz\)".*%\n\1%; ta; b; :a; s%.*\n%%; p'
Both commands yield:
citations.rdf.gz
databases.rdf.gz
diseases.rdf.gz
enzyme.rdf.gz
go.rdf.gz
journals.rdf.gz
keywords.rdf.gz
locations.rdf.gz
pathways.rdf.gz
taxonomy.rdf.gz
tissues.rdf.gz
uniparc.rdf.gz
uniprot.rdf.gz
uniref.rdf.gz
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.