Header Ads Widget

Responsive Advertisement

Several simple languages

 Several simple languages - but by no means all - can be conveniently specified using the notation of regular expressions. A regular expression specifies the form that a string may take by using the symbols from the alphabet A in conjunction with a few other metasymbols, which represent operations that allow for Concatenation - symbols or strings may be concatenated by writing them next to one another, or by using the metasymbol · (dot) if further clarity is required. Alternation - a choice between two symbols a and b is indicated by separating them by the metasymbol | (bar). Repetition - a symbol a followed by the metasymbol * (star) indicates that a sequence of zero or more occurrences of a is allowable. Grouping - a group of symbols may be surrounded by the metasymbols ( and ) (parentheses). As an example of a regular expression, consider 1 ( 1 | 0 )* 0 This generates the set of strings, each of which has a leading 1, is followed by any number of 0’s or 1’s, and is terminated with a 0 - that is, the set { 10, 100, 110, 1000 ... } If a semantic interpretation is required, the reader will recognize this as the set of strings representing non-zero even numbers in a binary representation, Formally, regular expressions may be defined inductively as follows: A regular expression denotes a regular set of strings. Ø is a regular expression denoting the empty set. is a regular expression denoting the set that contains only the empty string. is a regular expression denoting a set containing only the string . If A and B are regular expressions, then ( A ) and A | B and A · B and A* are also regular expressions. Thus, for example, if and are strings generated by regular expressions, and · are also generated by a regular expression. The reader should take note of the following points: As in arithmetic, where multiplication and division take precedence over addition and subtraction, there is a precedence ordering between these operators. Parentheses take precedence over repetition, which takes precedence over concatenation, which in turn takes precedence over alternation. Thus, for example, the following two regular expressions are equivalent his | hers and h ( i | er ) s and both define the set of strings { his , hers }. If the metasymbols are themselves allowed to be members of the alphabet, the convention is to enclose them in quotes when they appear as simple symbols within the regular expression. For example, comments in Pascal may be described by the regular expression "(" "*" c* "*" ")" where c A Some other shorthand is commonly found. For example, the positive closure symbol + is sometimes used, so that a+ is an alternative representation for a a*. A question mark is sometimes used to denote an optional instance of a, so that a? denotes a | . Finally, brackets and hyphens are often used in place of parentheses and bars, so that [a-eBC] denotes (a | b | c | d | e | B | C). Regular expressions have a variety of algebraic properties, among which we can draw attention to A | B = B | A (commutativity for alternation) A | ( B | C ) = ( A | B ) | C (associativity for alternation) A | A = A (absorption for alternation) A · ( B · C ) = ( A · B ) · C (associativity for concatenation) A ( B | C ) = A B | A C (left distributivity) ( A | B ) C = A C | B C (right distributivity) A = A = A (identity for concatenation) A * A * = A * (absorption for closure) Regular expressions are of practical interest in programming language translation because they can be used to specify the structure of the tokens (like identifiers, literal constants, and comments) whose recognition is the prerogative of the scanner (lexical analyser) phase of a compiler. For example, the set of integer literals in many programming languages is described by the regular expression (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)+ or, more verbosely, by (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9) · (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)* or, more concisely, by [0-9]+ and the set of identifiers by a similar regular expression (a | b | c | ... | Z) · (0 | 1 | ... | 9 | a | ... | Z)* or, more concisely, by [a-zA-Z][a-zA-Z0-9]* Regular expressions are also powerful enough to describe complete simple assembler languages of the forms illustrated in the last chapter, although the complete expression is rather tedious to write down, and so is left as an exercise for the zealous reader.

Post a Comment

0 Comments