Introduction
This particular HTML browser project has been discontinued. Major development has been in 1996, with the last minor bugfix version from 2002-03-18, which is available for download. Instead there is a new ongoing project: Synx.
A simple HTML Browser realized as a Java Applet
Scanning and Parsing
To evaluate a complex language, the technique of scanners and parsers can be used. Before interpreting a text written in a (formal or informal) language, it is evident that the text is broken into separate tokens. This lexical process is done by the scanner which must know what tokens to recognize. Then the resulting TokenSequence is checked and the tokens are semantically grouped. This grammatical process is done by the parser which must know rules for the grammar of the language. Having all tokens scanned and semantically grouped according to grammatical rules, the interpretation of the statements (or sentences) can start depending upon the specific task.The Tokens
Tokens consist of- a grammatical type which they belong to, like TAG
- a symbol that they match, like Assurance
- a token string which they raise when they occur, like BODY
-
type|symbol|token|
- definite tokens, explicitly naming a String that they match to, like Assurance
- indefinite or regular tokens, matching to all Strings that fulfil a certain Regular Expression, like [A-Z][0-9]*
The Scanner
The scanner is a base class that consecutively scans in tokens from an input Reader or scanner. The token Declarations must be provided, either directly as an array or indirectly with the token declaration syntax in a (File-)Reader.This implementation scans definite tokens explicitly declared in the lexical token declaration file (like Assurance) and regular tokens specified as a Regular Expression (like [A-Z][0-9]*).
Reading from an input Reader, the scanner first tries to match the longest definite token possible. If no token matches or no alternativeToken() can be found for the current symbolPart, then all Regular Expression specifications are run through a non deterministic Automata matching Regular Expressions if possible.
The Parser
A parser is capable of parsing a TokenSequence and returning Symbols in that the tokens result. For HTML the parser is rather simple in a way that it simply concatenates the tokens of type WORD and CIRCUM. Additionally he recursively starts another HTML parser when a token of type TAG is found and finishes of, if the matching ETAG (if necessary) is found.The Interpreter - Browser
For this simple HTML Browser, the Interpreter only prints out the set of symbol words parsed and reacts to the enclosing Tags. For every Tag known by this Browser, another graphical style is used before the text is displayed. I know that this is not all a Browser does when displaying but for a single demonstration of a parser's possibilities it's enough, I think.Of course, this HTML Browser has a regard on line breaks and automatically fits the text to the next line if necessary. The Browser still does not follow links (which actually is the most important facility a Browser should offer, by the way).
The token declaration file
For the HTML Browser, the lexical token declaration file is:For this HTML Browser example, the types areTAG|<HTML>|HTML| ETAG|</HTML>|HTML| TAG|<HEAD>|HEAD| ETAG|</HEAD>|HEAD| TAG|<BODY>|BODY| ETAG|</BODY>|BODY| TAG|<H1>|H1| ETAG|</H1>|H1| TAG|<H2>|H2| ETAG|</H2>|H2| TAG|<H3>|H3| ETAG|</H3>|H3| TAG|<B>|B| ETAG|</B>|B| TAG|<I>|I| ETAG|</I>|I| STAG|<BR>|BR| STAG|<P>|P| STAG|</P>|/P| WORD|([a-zA-Z]+)|=| SCHAR|([_.,;:!])|=| CIRCUM|ä|ä| CIRCUM|ö|ö| CIRCUM|ü|ü| CIRCUM|ß|ß| CIRCUM|é|é| SKIP|([ ]+)|=| SKIP| | |
- TAG - represents a beginning tag like <A>.
- ETAG - represents an ending tag like </A>.
- STAG - represents single tags that don't need a matching end tag like <BR>.
- WORD - regular expression that represents any natural language word (with normal chars).
- SCHAR - represents a special character like full stops and exclamation marks.
- CIRCUM - represents the HTML Circumscription chars for all unicode characters like é for é.
- SKIP - a type internal to the scanner that defines that all these tokens are regarded useless and be skipped.
Java HTML Browser
requires Java 2 Platform or Java Collections Framework. Application will display the parsed HTML page in a separate window.
Download
If you want to test the simple HTML browser or scanner & parser:
- download the sources which are part of the examples in the Orbital library 1.0 documentation
- Note that our simple browser project is discontinued in favor of Synx, so the simple browser is no longer contained in Orbital library release 1.1, but only in 1.0.