aboutsummaryrefslogtreecommitdiffstats
path: root/README
blob: 01838792786376cb675a302939b4b4cefdd26089 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
NAME
	libparser - Right-context-sensitive grammar parsing library

DESCRIPTION
	libparser is a small C library that parses input based on a
	precompiled right-context-sensitive grammar.

	To use libparser, a developer should write a syntax for the
	input that his application shall parse, in a syntax based
	on Extended Backus–Naur form (EBNF) (somewhat simplified but
	also somewhat extended). libparser-generate(1) is then used
	to create a C source file describing the syntax, which shall
	be compiled into an object file with a C compiler. This file
	provides a definition of a global variable declared in
	<libparser.h>: libparser_rule_table. This variable is used
	when calling libparser_parse_file(3) to parse the application's
	input.

	libparser is proudly non-self-hosted.

EXTENDED DESCRIPTION
    Syntax
	The grammar for libparser-generate(1)'s input can be described
	in its own grammar:

		(* CHARACTER CLASSES *)

		_space            = " " | "\n" | "\t";
		_alpha            = <"a", "z"> | <"A", "Z">;
		_octal            = <"0", "7">;
		_digit            = <"0", "9">;
		_xdigit           = _digit | <"a", "f"> | <"A", "F">;
		_nonascii         = <128, 255>;


		(* WHITESPACE/COMMENTS, THE GRAMMAR IS FREE-FORM *)

		_comment_str_esc  = "\\", (_space | <"!", 255>);
		_comment_str_char = _space | !"\"", <"!", 255>;
		_comment_str      =  "\"", {_comment_str_esc | _comment_str_char}, ("\"" | -);
		_comment_char     = _space | !"*)", !"\"", <"!", 0xFF>;
		_comment          = "(*", {_comment_char | _comment_str}, ("*)" | -);

		_                 = {_space | _comment};


		(* IDENTIFIERS *)

		_identifier_head  = _alpha | _digit | _nonascii | "_";
		_identifier_tail  = _identifier_head | "-";

		identifier        = _identifier_head, {_identifier_tail};


		(* STRINGS *)

		_escape_simple    = "\\" | "\"" | "'" | "a" | "b" | "f" | "n" | "r" | "t" | "v";
		_escape_hex       = ("x" | "X"), _xdigit, _xdigit;
		_escape_octal     = _octal, {_octal}; (* May not exceed 255 in base 10 *)
		_escape           = _escape_simple | _escape_hex | _escape_octal | -;
		_character        = "\\", _escape | !"\"", <" ", 0xFF>;
		_string           = "\"", _character, {_character}, ("\"" | -);

		string            = _string;
		character         = "\"", _character, ("\"" | -);


		(* INTEGERS *)

		_decimal          = _digit, {_digit};
		_hexadecimal      = "0", ("x" | "X"), _xdigit, {_xdigit};

		integer           = _hexadecimal | _decimal; (* May not exceed 255. *)


		(* GROUPINGS *)

		_low              = character | integer;
		_high             = character | integer;

		nondeterministic  = "?";

		committed         = "+", _, _operand;
		rejection         = "!", _, _operand;
		concatenation     = _operand, {_, ",", _, _operand};
		alternation       = concatenation, {_, [nondeterministic], "|", _, concatenation};
		optional          = [nondeterministic], "[", _, _expression, _, "]";
		repeated          = [nondeterministic], "{", _, _expression, _, "}";
		group             = "(", _, _expression, _, ")";
		char-range        = "<", _, _low, _, ",", _, _high, _, ">";
		exception         = "-";
		embedded-rule     = identifier;

		_literal          = char-range | exception | string;
		_group            = optional | repeated | group | embedded-rule;
		_operand          = _group | _literal | rejection | committed;

		_expression       = alternation;


		(* RULES *)

		rule              = identifier, _, "=", _, _expression, _, ";";

		(* This is the root rule of the grammar. *)
		grammar           = _, {rule, _};

	The file must be encoded in UTF-8, with LF as the line
	break (CR and FF are illegal just because).

	In alternations, the first (leftmost) match is selected. The
	parser is able to backtrack in case it later turns out that it
	could not finish that branch. Whenever an exception is
	reached, the parser will terminate there.

	Repeated symbols may occur any number of times, including
	zero. The parser will try to take as much as as possible.

	Optional symbols are taken whenever possible.

	Concatenation has higher precedence than alternation, groups
	("(", ..., ")") have no semantic meaning and are useful only
	for including alternations inside concatenations or put
	alternations or concatenations inside a commitment or
	rejection.

	The parser has the ability to be non-deterministic, which
	can make it really slow. The speed up parsing, you can add
	commits. Once the committed sentence has been matched, the
	branching-points inside it are unrecorded, so when the parser
	backtracks, it will not try different choices inside the
	committed sentence. Commit sentences are undone by the parser
	when it backtracks to a branching-point outside the committed
	sentence.

	In character ranges, the _high and _low values must be at
	least 0 and at most 255, and _high must be greater than _low.

	Rules that begin with an underscore will not show up for
	the application in the parse result, the rest of the rules
	will appear in the tree-formatted result.

	Left recursion is illegal (it will cause stack overflow at
	runtime as the empty condition before the recursion is always
	met). Likewise, repeated optional sentences are illegal; a
	repeated sentence must always consume input, otherwise it
	gets stuck.

    Right-context-sensitive grammar
	libparser originally used context-free grammar, but with
	introduction of the rejection rule, specifically the ability
	to reject a rejection, it became a phrase for
	right-context-sensitive grammar which is a grammar that is
	that can generate any context-sensitive language, it is however
	weakly equivalent to context-sensitive grammar.