简体   繁体   English

Python语法在内部如何使用?

[英]How is the Python grammar used internally?

I'm trying to get a deeper understanding of how Python works, and I've been looking at the grammar shown at http://docs.python.org/3.3/reference/grammar.html . 我试图对Python的工作方式有更深入的了解,并且一直在查看http://docs.python.org/3.3/reference/grammar.html上显示的语法。

I notice it says you would have to change parsermodule.c also, but truthfully I'm just not following what's going on here. 我注意到它说您也必须更改parsermodule.c,但说实话,我只是不关注这里发生的事情。

I understand that a grammar is a specification for how to read the language, but...I can't even tell what this is written in. It looks almost like Python but then it isn't. 我了解语法是如何阅读该语言的规范,但是...我什至不知道该写的是什么。它看起来几乎像Python,但实际上不是。

I'm looking to get a better understanding of this specification and how it is used internally by Python to....do things. 我希望更好地了解此规范以及Python在内部如何使用它来做事。 What depends on it (the answer is everything, but I mean specifically which aspect of the "engine" is processing it), what uses it, how does it tie in to compiling/running a script? 取决于什么(答案是一切,但是我具体是指“引擎”的哪个方面正在处理它),使用它的方式以及它与编译/运行脚本的关系如何?

It's hard to believe that the whole language comes down to a two page specification... 很难相信整个语言可以归结为两页的规范...

A grammar is used to describe all possible strings in a language. 语法用于描述语言中所有可能的字符串。 It is also useful in specifying how a parser should parse the language. 在指定解析器应如何解析语言时也很有用。

In this grammar it seems like they are using their own version of EBNF , where a non-terminal is any lowercase word and a terminal is all uppercase or surrounded by quotes. 在此语法中,似乎他们使用的是自己的EBNF版本,其中非终结是任何小写单词,而终结符都是大写或用引号引起来。 For example, NEWLINE is a terminal, arith_expr is a non-terminal and 'if' is also a terminal. 例如,NEWLINE是终端,arith_expr是非终端,而'if'也是终端。 Any non-terminal can be replaced by anything to the right of the colon of it's respective production rule. 任何非终结符都可以用其相应生产规则的冒号右边的任何内容替换。 For example, if you look at the first rule: 例如,如果您查看第一个规则:

single_input: NEWLINE | single_input:NEWLINE | simple_stmt | simple_stmt | compound_stmt NEWLINE compound_stmt NEWLINE

We can replace single_input with one of either a NEWLINE, a simple_stmt or a compound_stmt followed by a NEWLINE. 我们可以用newLINE,simple_stmt或compound_stmt后跟NEWLINE之一替换single_input。 Suppose we replaced it with "compound_stmt NEWLINE", then we would look for the production rule for compound_stmt: 假设我们将其替换为“ compound_stmt NEWLINE”,那么我们将寻找compound_stmt的生产规则:

compound_stmt: if_stmt | compound_stmt:if_stmt | while_stmt | while_stmt | for_stmt | for_stmt | try_stmt | try_stmt | with_stmt | with_stmt | funcdef | funcdef | classdef | classdef | decorated 装饰的

and choose which of these we want to use, and substitute it for "compound_stmt" (Keeping NEWLINE in it's place) 并选择我们要使用的其中一个,然后将其替换为“ compound_stmt”(将NEWLINE保留在此处)

Suppose we wanted to generate the valid python program: 假设我们要生成有效的python程序:

if 5 < 2 + 3 or not 1 == 5:
    raise

We could use the following derivation: 我们可以使用以下推导:

  1. single_input 单输入
  2. compound_stmt NEWLINE compound_stmt NEWLINE
  3. if_stmt NEWLINE if_stmt NEWLINE
  4. 'if' test ':' suite NEWLINE 'if'test':'套件NEWLINE
  5. 'if' or_test ':' NEWLINE INDENT stmt stmt DEDENT NEWLINE 'if'or_test':'NEWLINE INDENT stmt stmt DEDENT NEWLINE
  6. 'if' and_test 'or' and_test ':' NEWLINE INDENT simple_stmt DEDENT NEWLINE 'if'and_test'or'and_test':'NEWLINE INDENT simple_stmt DEDENT NEWLINE
  7. 'if' not_test 'or' not_test ':' NEWLINE INDENT small_stmt DEDENT NEWLINE '如果'not_test'或'not_test':'NEWLINE INDENT small_stmt DEDENT NEWLINE
  8. 'if' comparison 'or' 'not' not_test ':' NEWLINE INDENT flow_stmt DEDENT NEWLINE '如果'比较'或''不是'not_test':'NEWLINE INDENT flow_stmt DEDENT NEWLINE
  9. 'if' expr comp_op expr 'or' 'not' comparison ':' NEWLINE INDENT raise_stmt DEDENT NEWLINE 'if'expr comp_op expr'或''not'比较':''NEWLINE INDENT raise_stmt DEDENT NEWLINE
  10. 'if' arith_expr '<' arith_expr 'or' 'not' arith_expr comp_op arith_expr ':' NEWLINE INDENT 'raise' DEDENT NEWLINE 'if'arith_expr'<'arith_expr'或''not'arith_expr comp_op arith_expr':'NEWLINE INDENT'raise'DEDENT NEWLINE
  11. 'if' term '<' term '+' term 'or' 'not' arith_expr == arith_expr ':' NEWLINE INDENT 'raise' DEDENT NEWLINE 'if'词'<'词'+'词'或''not'arith_expr == arith_expr':'NEWLINE INDENT'raise'DEDENT NEWLINE
  12. 'if' NUMBER '<' NUMBER '+' NUMBER 'or' 'not' NUMBER == NUMBER ':' NEWLINE INDENT 'raise' DEDENT NEWLINE 'if'NUMBER'<'NUMBER'+'NUMBER'or''not'NUMBER == NUM​​BER':'NEWLINE INDENT'raise'DEDENT NEWLINE

A couple of notes here, firstly, we must start with one of the non-terminals which is listed as a starting non-terminal. 首先,在这里有两个注意事项,我们必须从被列为起始非终端的非终端之一开始。 In that page, they list them as single_input, file_input, or eval_input. 在该页面中,他们将它们列出为single_input,file_input或eval_input。 Secondly, a derivation is finished once all the symbols are terminal (hence the name). 其次,一旦所有符号都终止了,派生就完成了(因此得名)。 Thirdly, it is more common to do one substitution per line, for the sake of brevity I did all possible substitutions at once and started skipping steps near the end. 第三,更常见的做法是每行进行一次替换,为简洁起见,我立即进行了所有可能的替换,并开始在结尾处跳过步骤。

Given a string in the language, how do we find it's derivation? 给定语言字符串,我们如何找到它的派生? This is the job of a parser. 这是解析器的工作。 A parser reverse-engineers a production sequence to first check that it is indeed a valid string, and furthermore how it can be derived from the grammar. 解析器对生产序列进行逆向工程,以首先检查它是否确实是有效的字符串,然后再检查如何从语法中得出它。 It's worth noting that many grammars can describe a single language. 值得注意的是,许多语法可以描述一种语言。 However, for a given string, it's derivation will of course be different for each grammar. 但是,对于给定的字符串,每个语法的推导当然会有所不同。 So technically we write a parser for a grammar not a language. 因此,从技术上讲,我们为语法而不是语言编写解析器。 Some grammars are easier to parse, some grammars are easier to read/understand. 一些语法更容易解析,一些语法更易于阅读/理解。 This one belongs in the former. 这个属于前者。

Also this doesn't specify the entire language, just what it looks like. 同样,这并没有指定整个语言,而是它的外观。 A grammar says nothing about semantics. 语法对语义一无所知。

If you're interested in more about parsing and grammar I recommend Grune, Jacobs - Parsing Techniques . 如果您对解析和语法有更多的兴趣,我建议使用Grune,Jacobs-解析技术 It's free and good for self-study. 它是免费的,适合自学。

The python grammar - as most others - is given in BNF or Backus–Naur Form . 如同大多数其他语法一样,python语法以BNFBackus–Naur形式给出 Try reading up on how to read it but the basic structure is: 尝试阅读有关如何阅读的内容,但基本结构为:

<something> ::= (<something defined elsewhere> | [some fixed things]) [...]

This is read as a <something> is defined as something else or any of the fixed things repeated a multitude of times. 这被理解为<something> 它定义为 something else 重复了许多次的任何固定事物。

BNF is based on a nearly 2000 year old format for describing the permitted structure of a language, is incredibly terse and will describe all the allowed structures in a given language, not necessarily all those that would make sense . BNF基于一种将近2000年的格式来描述一种语言的允许结构,它简直令人难以置信,并且将以给定语言描述所有允许的结构, 不一定是所有有意义的结构

Example

Basic arithmetic can be described as: 基本算术可描述为:

<simple arithmetic expression> ::= <numeric expr>[ ]...(<operator>[ ]...<numeric expr>|<simple arithmetic expression>)
<numeric expr> ::= [<sign>]<digit>[...][.<digit>[...]]
<sign> ::= +|-
<operator> ::= [+-*/]
<digit> ::= [0123456789]

Which says that a simple arithmetic operation is an, optionally signed, number consisting of one or more digits, possibly with a decimal point and one, or more, subsequent digits, optionally followed by spaces, followed by exactly one of +-*/ , optionally followed by spaces, followed by either a number or another simple arithmetic operation, ie a number followed by, etc. 这表示简单的算术运算是一个可选的带符号数字,由一个或多个数字组成,可能带有小数点,以及一个或多个后续数字,可选地后面跟空格,再紧跟+-*/ 。 (可选)后跟空格,后跟数字或另一个简单的算术运算,即后跟数字,等等。

This describes, just about, all of the basic arithmetic operations and can be extended to include functions, etc. Notice that does allow invalid operations that are a valid syntax, eg: 22.34 / -0.0 is valid syntactically even though the result is not valid. 几乎描述所有基本算术运算,并且可以扩展为包括函数等。请注意,它的确允许使用有效语法的无效运算,例如:即使结果无效, 22.34 / -0.0在语法上仍然有效。

It can sometimes make you aware that operations are possible that you might not have thought of, eg: 56+-50 is a valid operation as is 2*-10 but 2*/3 is not. 有时它可能使您意识到您可能没有想到的操作,例如: 56+-50是有效的操作,而2*-10却是,但2*/3则不是。

Note that SGML and XML / Schema are both related but different methodologies for describing the structure of any language. 注意, SGMLXML / Schema都是相关的,但是描述任何语言结构的方法不同。 YAML is another method for describing the allowed structures in a computer specific languages. YAML是用于以计算机特定语言描述允许的结构的另一种方法。

Disclaimer: My BNF is a little rusty so if I have made any major mistakes in the above my apologies and please correct me. 免责声明:我的BNF有点生锈,因此如果我在上面的道歉中犯了任何重大错误,请纠正我。

这基本上是EBNF (扩展Backus–Naur格式)规范。

When you write a program in a language, the very first thing your interpreter/compiler must do in order to go from a sequence of characters to actual action is to translate that sequence of characters in a higher complexity structure. 当您使用某种语言编写程序时,解释器/编译器要想从一个字符序列转换为实际动作,必须要做的第一件事就是以更高的复杂度结构转换该字符序列。 To do so, first it chunks up your program in a sequence of tokens expressing what each "word" represents. 为此,首先,它用一系列表示每个“单词”代表的标记的程序将您的程序分块。 For example, the construct 例如,构造

if foo == 3: print 'hello'

will be converted into 将被转换成

1,0-1,2:    NAME    'if'
1,3-1,6:    NAME    'foo'
1,7-1,9:    OP  '=='
1,10-1,11:  NUMBER  '3'
1,11-1,12:  OP  ':'
1,13-1,18:  NAME    'print'
1,19-1,26:  STRING  "'hello'"
2,0-2,0:    ENDMARKER   ''

But note that even something like "if if if if" is correctly made into tokens 但是请注意,即使像“ if if if if”这样的内容也可以正确地标记为令牌

1,0-1,2:    NAME    'if'
1,3-1,5:    NAME    'if'
1,6-1,8:    NAME    'if'
1,9-1,11:   NAME    'if'
2,0-2,0:    ENDMARKER   ''

What follows the tokenization is the parsing into a higher level structure that analyzes if the tokens actually make sense taken together, something that the latter example does not, but the first does. 标记化之后是解析为更高级别的结构,该结构分析标记是否真的合在一起有意义,后一个示例没有,但第一个示例有。 To do so, the parser must recognize the actual meaning of the tokens (eg the if is a keyword, and foo is a variable), then build a tree out of the tokens, organizing them in a hierarchy and see if this hierarchy actually makes sense. 为此,解析器必须识别标记的实际含义(例如,if是关键字,而foo是变量),然后从标记中构建一棵树,将它们组织成一个层次结构,看看该层次结构是否真正使感。 Here is where the grammar you are seeing comes in. That grammar is in BNF, which is a notation to express the constructs the language can recognize. 这是您所看到的语法的来源。该语法在BNF中,这是表达该语言可以识别的结构的一种表示法。 That grammar is digested by a program (for example, bison) which has the magic property of taking that grammar and generate actual C code that does the heavy work for you, normally by recognizing the tokens, organizing them, returning you a parse tree, or tell you where there's a mistake. 该语法由程序(例如bison)消化,该程序具有采用该语法的神奇属性,并生成实际的C代码,通常可以通过识别标记,组织标记,返回解析树,为您完成繁重的工作,或告诉你哪里出了错。

Short version: developing a language is about defining tokens and how these tokens are put together to give something meaningful. 简短版:开发一种语言是关于定义令牌以及如何将这些令牌组合在一起以提供有意义的东西。 This is done through the grammar, which you use to generate the actual "parser" code with automated tools. 这是通过语法完成的,您可以使用语法通过自动化工具生成实际的“解析器”代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM