NAME libparser - Right-context-sensitive grammar parsing library DESCRIPTION libparser is a small C library that parses input based on a precompiled right-context-sensitive grammar. To use libparser, a developer should write a syntax for the input that his application shall parse, in a syntax based on Extended Backus–Naur form (EBNF) (somewhat simplified but also somewhat extended). libparser-generate(1) is then used to create a C source file describing the syntax, which shall be compiled into an object file with a C compiler. This file provides a definition of a global variable declared in : libparser_rule_table. This variable is used when calling libparser_parse_file(3) to parse the application's input. libparser is proudly non-self-hosted. EXTENDED DESCRIPTION Syntax The grammar for libparser-generate(1)'s input can be described in its own grammar: (* CHARACTER CLASSES *) _space = " " | "\n" | "\t"; _alpha = <"a", "z"> | <"A", "Z">; _octal = <"0", "7">; _digit = <"0", "9">; _xdigit = _digit | <"a", "f"> | <"A", "F">; _nonascii = <128, 255>; (* WHITESPACE/COMMENTS, THE GRAMMAR IS FREE-FORM *) _comment_char = _space | !"*", !"\"", <"!", 0xFF>; _comment_tail = [_comment_char], [_string], ("*)" | _comment_tail | -); _comment = "(*", _comment_tail; _ = {_space | _comment}; (* IDENTIFIERS *) _identifier_head = _alpha | _digit | _nonascii | "_"; _identifier_tail = _identifier_head | "-"; identifier = _identifier_head, {_identifier_tail}; (* STRINGS *) _escape_simple = "\\" | "\"" | "'" | "a" | "b" | "f" | "n" | "r" | "t" | "v"; _escape_hex = ("x" | "X"), _xdigit, _xdigit; _escape_octal = _octal, {_octal}; (* May not exceed 255 in base 10 *) _escape = _escape_simple | _escape_hex | _escape_octal | -; _character = "\\", _escape | !"\"", <" ", 0xFF>; _string = "\"", _character, {_character}, ("\"" | -); string = _string character = "\"", _character, ("\"" | -); (* INTEGERS *) _decimal = _digit, {_digit}; _hexadecimal = "0", ("x" | "X"), _xdigit, {_xdigit}; integer = _decimal | _hexadecimal; (* May not exceed 255. *) (* GROUPINGS *) _low = character | integer; _high = character | integer; nondeterministic = "?"; committed = "+", _, _operand; rejection = "!", _, _operand; concatenation = _operand, {_, ",", _, _operand}; alternation = concatenation, {_, [nondeterministic], "|", _, concatenation}; optional = [nondeterministic], "[", _, _expression, _, "]"; repeated = [nondeterministic], "{", _, _expression, _, "}"; group = "(", _, _expression, _, ")"; char-range = "<", _, _low, _, ",", _, _high, "_", ">"; exception = "-"; embedded-rule = identifier; _literal = char-range | exception | string; _group = optional | repeated | group | embedded-rule; _operand = _group | _literal | rejection | committed; _expression = alternation; (* RULES *) rule = identifier, _, "=", _, _expression, _, ";"; (* This is the root rule of the grammar. *) grammar = _, {rules, _}; The file must be encoded in UTF-8, with LF as the line break (CR and FF are illegal just because). In alternations, the first (leftmost) match is selected. The parser is able to backtrack in case it later turns out that it could not finish that branch. Whenever an exception is reached, the parser will terminate there. Repeated symbols may occur any number of times, including zero. The parser will try to take as much as as possible. Optional symbols are taken whenever possible. Concatenation has higher precedence than alternation, groups ("(", ..., ")") have no semantic meaning and are useful only for including alternations inside concatenations or put alternations or concatenations inside a commitment or rejection. The parser has the ability to be non-deterministic, which can make it really slow. The speed up parsing, you can add commits. Once the committed sentence has been matched, the branching-points inside it are unrecorded, so when the parser backtracks, it will not try different choices inside the committed sentence. Commit sentences are undone by the parser when it backtracks to a branching-point outside the committed sentence. In character ranges, the _high and _low values must be at least 0 and at most 255, and _high must be greater than _low. Rules that begin with an underscore will not show up for the application in the parse result, the rest of the rules will appear in the tree-formatted result. Left recursion is illegal (it will cause stack overflow at runtime as the empty condition before the recursion is always met). Likewise, repeated optional sentences are illegal; a repeated sentence must always consume input, otherwise it gets stuck. Right-context-sensitive grammar libparser originally used context-free grammar, but with introduction of the rejection rule, specifically the ability to reject a rejection, it became a phrase for right-context-sensitive grammar which is a grammar that is that can generate any context-sensitive language, it is however weakly equivalent to context-sensitive grammar.