[英]Using ruby's treetop peg to parse a debian Packages.gz
我正在嘗試使用Ruby的樹梢結構打開Packages.gz,但無法使關鍵字和值明確。 這是我的樹梢語法:
grammar Debian
rule collection
entry+
end
rule entry
(tag space value)
end
rule package_details
tag value &[^$]
end
rule tag
[A-Za-z0-9\-]+ ":"
end
rule value
(!tag value_line+ "\n")+
end
rule value_line
([A-Za-z0-9 <>@()=\.\-|/,_"':])+
end
rule space
[ \t]+
end
end
這是我的示例輸入:
Package: acct
Priority: optional
Section: admin
Installed-Size: 352
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Mathieu Trudel <mathieu.tl@gmail.com>
Architecture: i386
Version: 6.5.4-2ubuntu1
Depends: dpkg (>= 1.15.4) | install-info, libc6 (>= 2.4)
Filename: pool/main/a/acct/acct_6.5.4-2ubuntu1_i386.deb
Size: 111226
MD5sum: 10cba1458ace8c31169c0e9e915c9a0f
SHA1: 6c2dcdc480144a9922329cd4fa22c7d1cb83fcbb
SHA256: bf8d8bb8eef3939786a1cefc39f94079f43464b71099f4a59b61b24cafdbc010
Description: The GNU Accounting utilities for process and login accounting
GNU Accounting Utilities is a set of utilities which reports and summarizes
data about user connect times and process execution statistics.
.
"Login accounting" provides summaries of system resource usage based on connect
time, and "process accounting" provides summaries based on the commands
executed on the system.
.
The 'last' command is provided by the sysvinit package and not included here.
Homepage: http://www.gnu.org/software/acct/
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Origin: Ubuntu
Supported: 18m
這幾乎可以100%起作用,但是隨后在檢查URL時失敗。 問題在於URL包含一個“:”,我似乎無法識別它。 當我編輯樣本的“主頁”條目並用“ _”代替“:”時,它會一直顯示。
這是我的第一個PEG語法,但我可以告訴我,需要使其變得不那么模糊/更簡潔。 查看高級文檔,我想將標簽定義為
rule tag
!(!'\n' .) [A-Za-z0-9\-]+ ":"
end
但是我不完全了解它在做什么。 標簽不得(我沒有要換行或其他任何內容)(換行或一無所有)。 這些微妙之處使我無法逃脫...
切換到該格式會幫助我嗎? 有人知道為什么不匹配嗎?
在這一點上,我似乎已經有了一個可行的語法:
grammar Debian
# The file is too big for us to emit a package_list. Look at parser.rb to see how I just split the string.
#rule package_list
# (package "\n"?)+ <DebianSyntaxNode::PackageList>
#end
rule package
(tag / value)+ <DebianSyntaxNode::Package>
end
rule tag
tag_value tag_stop <DebianSyntaxNode::Tag>
end
rule tag_value
[\w\-]+ <DebianSyntaxNode::TagValue>
end
rule tag_stop
": " <DebianSyntaxNode::TagStop>
end
rule value
value_line value_stop <DebianSyntaxNode::Value>
# value_line value_stop <DebianSyntaxNode::Value>
end
rule value_line
(!"\n" .)+ <DebianSyntaxNode::ValueLine>
# ([\w \. " , \- ' : / < > @ ( ) = | \[ \] + ; ~ í á * % `])+ <DebianSyntaxNode::ValueLine>
end
rule value_stop
"\n"? <DebianSyntaxNode::ValueStop>
end
end
問題在於,當value_line是多行條目時,現在不包含“ \\ n”。 另外,我必須在解析器中組合多行條目。
如果您想查看這段代碼的去向,請查看我開始的一個小github項目: https : //github.com/derdewey/Debian-Packages-Parser
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.