簡體   English   中英

從 lark-parser 返回的 AST 中刪除正則表達式終端

[英]Removal of regex terminal from AST returned by lark-parser

我有興趣使用lark解析網站爬蟲的典型 output 。 這是基於我自己的 github 網站的一些示例 output 的示例:

--------------------------------------------------------------------
All found URLs:
https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
https://awa5114.github.io/2021/01/12/mypy-pycharm.html
https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
--------------------------------------------------------------------
All local URLs:
https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
https://awa5114.github.io/2021/01/12/mypy-pycharm.html
--------------------------------------------------------------------
All foreign URLs:
https://github.com/awa5114
https://github.com/jekyll/jekyll
https://github.com/jekyll/minima
--------------------------------------------------------------------
All broken URLs:

我正在使用以下語法:

start: section~4
section: (bar  "All " descriptor " URLs:"  link_list)
link_list: (url)*
descriptor: "found" | "local" | "foreign" | "broken"
url: /.+/
bar: /-{68}/

%import common.NEWLINE
%ignore NEWLINE

在結果樹上調用pretty會產生以下結果:

start
  section
    bar --------------------------------------------------------------------
    descriptor
    link_list
      url   https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
      url   https://awa5114.github.io/2021/01/12/mypy-pycharm.html
      url   https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
  section
    bar --------------------------------------------------------------------
    descriptor
    link_list
      url   https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
      url   https://awa5114.github.io/2021/01/12/mypy-pycharm.html
  section
    bar --------------------------------------------------------------------
    descriptor
    link_list
      url   https://github.com/awa5114
      url   https://github.com/jekyll/jekyll
      url   https://github.com/jekyll/minima
  section
    bar --------------------------------------------------------------------
    descriptor
    link_list

這看起來沒問題,但我不想在我的樹中包含終端bar 我怎樣才能做到這一點? 我瀏覽了文檔並嘗試使用下划線和/或問號在前面的bar ,但由於某種原因,這無濟於事......

其實我現在才發現。 做到這一點的方法不僅是在bar前面加上下划線,而且將其變為大寫,如下所示:

start: section~4
section: (_BAR  "All " descriptor " URLs:"  link_list)
link_list: (url)*
descriptor: "found" | "local" | "foreign" | "broken"
url: /.+/
_BAR: /-{68}/

%import common.NEWLINE
%ignore NEWLINE

這導致以下樹:

start
  section
    descriptor
    link_list
      url   https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
      url   https://awa5114.github.io/2021/01/12/mypy-pycharm.html
      url   https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
  section
    descriptor
    link_list
      url   https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
      url   https://awa5114.github.io/2021/01/12/mypy-pycharm.html
  section
    descriptor
    link_list
      url   https://github.com/awa5114
      url   https://github.com/jekyll/jekyll
      url   https://github.com/jekyll/minima
  section
    descriptor
    link_list

如果在lark-parser文檔中明確這一點會很好......

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM