Parser-Example: HTML-Browser

Introduction

This particular HTML browser project has been discontinued. Major development has been in 1996, with the last minor bugfix version from 2002-03-18, which is available for download. Instead there is a new ongoing project: Synx.

A simple HTML Browser realized as a Java Applet

Scanning and Parsing

To evaluate a complex language, the technique of scanners and parsers can be used. Before interpreting a text written in a (formal or informal) language, it is evident that the text is broken into separate tokens. This lexical process is done by the scanner which must know what tokens to recognize. Then the resulting TokenSequence is checked and the tokens are semantically grouped. This grammatical process is done by the parser which must know rules for the grammar of the language. Having all tokens scanned and semantically grouped according to grammatical rules, the interpretation of the statements (or sentences) can start depending upon the specific task.

The Tokens

Tokens consist of

a grammatical type which they belong to, like TAG
a symbol that they match, like Assurance
a token string which they raise when they occur, like BODY

Tokens, if declared literally in a file are specified as:

type

symbol

token

Two kinds of tokens exist:

definite tokens, explicitly naming a String that they match to, like Assurance
indefinite or regular tokens, matching to all Strings that fulfil a certain Regular Expression, like [A-Z][0-9]*

The Scanner

The scanner is a base class that consecutively scans in tokens from an input Reader or scanner. The token Declarations must be provided, either directly as an array or indirectly with the token declaration syntax in a (File-)Reader.
This implementation scans definite tokens explicitly declared in the lexical token declaration file (like Assurance) and regular tokens specified as a Regular Expression (like [A-Z][0-9]*).
Reading from an input Reader, the scanner first tries to match the longest definite token possible. If no token matches or no alternativeToken() can be found for the current symbolPart, then all Regular Expression specifications are run through a non deterministic Automata matching Regular Expressions if possible.

The Parser

A parser is capable of parsing a TokenSequence and returning Symbols in that the tokens result. For HTML the parser is rather simple in a way that it simply concatenates the tokens of type WORD and CIRCUM. Additionally he recursively starts another HTML parser when a token of type TAG is found and finishes of, if the matching ETAG (if necessary) is found.

The Interpreter - Browser

For this simple HTML Browser, the Interpreter only prints out the set of symbol words parsed and reacts to the enclosing Tags. For every Tag known by this Browser, another graphical style is used before the text is displayed. I know that this is not all a Browser does when displaying but for a single demonstration of a parser's possibilities it's enough, I think.

Of course, this HTML Browser has a regard on line breaks and automatically fits the text to the next line if necessary. The Browser still does not follow links (which actually is the most important facility a Browser should offer, by the way).

The token declaration file

For the HTML Browser, the lexical token declaration file is:

TAG|<HTML>|HTML|
ETAG|</HTML>|HTML|
TAG|<HEAD>|HEAD|
ETAG|</HEAD>|HEAD|
TAG|<BODY>|BODY|
ETAG|</BODY>|BODY|
TAG|<H1>|H1|
ETAG|</H1>|H1|
TAG|<H2>|H2|
ETAG|</H2>|H2|
TAG|<H3>|H3|
ETAG|</H3>|H3|
TAG|<B>|B|
ETAG|</B>|B|
TAG|<I>|I|
ETAG|</I>|I|
STAG|<BR>|BR|
STAG|<P>|P|
STAG|</P>|/P|
WORD|([a-zA-Z]+)|=|
SCHAR|([_.,;:!])|=|
CIRCUM|&auml;|ä|
CIRCUM|&ouml;|ö|
CIRCUM|&uuml;|ü|
CIRCUM|&szlig;|ß|
CIRCUM|&eacute;|é|
SKIP|([ ]+)|=|
SKIP|
| |

For this HTML Browser example, the types are

TAG - represents a beginning tag like <A>.
ETAG - represents an ending tag like </A>.
STAG - represents single tags that don't need a matching end tag like <BR>.
WORD - regular expression that represents any natural language word (with normal chars).
SCHAR - represents a special character like full stops and exclamation marks.
CIRCUM - represents the HTML Circumscription chars for all unicode characters like é for é.
SKIP - a type internal to the scanner that defines that all these tokens are regarded useless and be skipped.

Java HTML Browser

requires Java 2 Platform or Java Collections Framework. Application will display the parsed HTML page in a separate window.

Download

If you want to test the simple HTML browser or scanner & parser:

download the sources which are part of the examples in the Orbital library 1.0 documentation
Note that our simple browser project is discontinued in favor of Synx, so the simple browser is no longer contained in Orbital library release 1.1, but only in 1.0.

The Future

See introduction.