My Programs: Experimental Web Browser

Years ago, I started to design and code my own graphical HTML browser. However, once graphical browsers became universally established, this at first appeared to be no longer necessary. Notwithstanding, since I cannot know what else mainstream browsers may be doing inside my computer, it may well become expedient to resurrect this project.

In particular, my HTML tag filter-cum-function caller and character entity substituter could be gainfully employed by automatic document search and classification tools. They could also be adapted to search for and expedite additional non-standard tags in special-purpose bespoke Web applications.

Design Criteria

It must be able to display diagrams and photographs in the text stream. Therefore it must be graphics-based. Decision: make it a graphics-based program. It will flow text so as to fit the current width of the window.

However, it must provide for the <listing> and <code> tags which display text 'as is'. This may mean that text will have to be right-chopped. In this case it must provide for horizontal scrolling. This is also necessary for wide images when they will not fit completely within the current window width.

Design Questions

Do I (1) make it translate the entire HTML file into a graphics map before display and then scroll the map? This will require a limit on galley width. Or, if I use the maximum galley width encountered in the file, it may mean using up absolutely vast amounts of memory to accommodate it.

Or do I (2) map only what is currently to be displayed? Then I can make the image as wide as the widest line within the field of display. The biggest problem here is scrolling backwards. How do I determine where the beginning of the previous line is in the file? Here, display speed could be a problem. However, it will require far less disk space to accommodate files.

Since I may need to store lots of HTML and GIF files on the disk, it is sensible to adopt choice (2). The program will therefore need the following.

HTML File Pointers

It must have two file pointers FP1 and FP2 which mark the beginning and end of the section of the HTML file stream currently being displayed. These pointers will not necessarily always be the same distance apart for a given window size. This is be­cause they will have to traverse non-displayable parts of the file — namely the HTML tags and their attributes. The positions of these two file pointers must there­fore be computed separately and independently every time one of them is moved.

When a given HTML file is first displayed, FP1 remains at the start of the file while FP1 is advanced. As it advances, the HTML tags it encounters are expedited and the text is displayed. This continues until the window is full.

With <listing> and <code> tags, text flow may be interrupted as follows. As the window's right margin is reached at the end of each line of text, display is suppr­essed, but scanning of the text is continued until the full virtual galley width has been reached.

Scrolling

To move forwards in the file (ie down the page) FP2 is advanced until a complete new displayable line has been abstracted from the HTML file. FP1 is then advanced to the beginning of what was the second line of the display which is now the first. This is made easier if a file pointer FP3 is used always to keep the start position of the second row of the current display. Then all I have to do is move FP3 to FP1. Then, while re-displaying the window, put the start of the second line into FP3.

To move backwards is more difficult. First scan backwards from FP1 until enough text or graphics image has been re-abstracted to produce the previous complete line. The start of this line is the new FP1. The old value of FP1 having been put in FP3. FP2 is then moved back to the beginning of the previous line. It is therefore a good idea also to keep the next-to-last line pointer in FP4 for this purpose.

Another approach would be to scan the whole HTML file before it is displayed to find the positions of the beginning of each displayable line within the HTML file. These would have to be re-computed if the window were re-sized. However, a re­sizing event does not occur anywhere nearly as often as a scrolling event. There­fore, this could save a lot of time and complication for the scrolling operation. An array of file pointers FP[] could be defined into which the places where the display­able lines start in the HTML file are put during a pre-display scan of the whole file.

HTML files tend to be short as text files go. They are usually no more than a page of A4 — about 64 occupied lines. An array FP[1024] would comfortably accommodate files of 16 pages — about a chapter. However, this could be a dynamic array which is allocated at the time the file is scanned. I think this is the best approach.

Scrolling now becomes much simpler. The program simply moves up or down the array as required. It then abstracts the new line from the HTML file. It then copies the screen contents (pixel-by-pixel) up or down by the number of pixels equivalent to 1 line of text. Finally, it lays in the new line. So to progress one line along the HTML file (i.e. down the page), the process is as shown below.

Line poiter array when scrolling down a page.

And to progress one line backwards along the HTML file (i.e. up the page), the pro­cess is as shown below.

Line poiter array when scrolling up a page.

There is also a need to be able to advance one whole window full at a time. This is done as follows:

Line poiter array when scrolling up or down a whole window.

The number of lines per window-full, n, varies with the current window height and with the point-size of the current font. It must therefore be re-calculated whenever either of these is changed (which isn't very often). The pointers p1 and p2 are poin­ters to file-pointers. The base pointer p is also a pointer to a file-pointer and points to the start of the dynamic array of file pointers. To advance along the HTML file by one window-full, proceed as follows.

p1 += n; p2 += n;   //increment the two pointers by n lines.

Then, starting at point *p1 (ie from the byte position within the HTML file stream held in the array element pointed to by p1) in the HTML file, display the interpreted contents of the file. To go back, proceed as before except decrement p1 and p2 viz.

p1 -= n; p2 -= n;   //decrement the two pointers by n lines.

Hyper-Link Mapping

The positions of hyper-text within the window have to be registered. This is best done by an array of co-ordinates. The resolution required is that of a pixel. Remem­ber that hyper-links can be hot-spots on geographic maps. A pair of 16-bit coordin­ates will therefore suffice.

What is the best way of expressing the positions of hyper-text within a window? The first consideration is that the test to see if the mouse pointer is within a hot-spot must be kept as fast and as simple as possible. It has to be done every time the mouse moves. The hot-spots must therefore be expressed in a form as close to mouse co-ordinates as possible. At the very least they should be in window-relative pixels. For each item of hyper-text you need to store the diagonal co-ordinates of the pixel box containing it, together with the URL with which it is associated.

Data structures for hyperlinks.

Hyper-Link Data Structure

For each hyper-link on the screen, a data structure is needed to accommodate the above details. A suitable data structure is as follows.

struct hyperlink {    //define a hyperlink data structure
  int y1;             //Y-co-ordinate of top left of hot-box
  int y2;             //Y-co-ordinate of bottom right of hot-box
  int x1;             //X-co-ordinate of top right of hot box
  int x2;             //X-co-ordinate of bottom right of hot box
  FILE *url;          //file pointer to stream containing its URL
}
  *HL[256];           /*declare an array of pointers to 
                        hyperlink structures */

A 256-element array of pointers to structures of this type is declared. For the time being it will be presumed to be global (i.e. declared as being outside all functions. I think that an upper limit of 256 hyperlinks per screen is more than adequate. Mem­ory for each structure will be allocated using malloc() each time a hyper-link is encountered when a new window full of text/graphics is displayed.

HyperCheck Function

Next is required a function to check whether or not the mouse is currently over hyper-text. This function must receive the current window co-ordinates of the mouse and the current number of hyper-links displayed on the screen (i.e. the num­ber of hyper-links to be tested).

FILE *HyperCheck  //check whether or not mouse is over hyper-text
(
   int x,          //pixel x-co-ordinate of mouse within window
   int y,          //pixel y-co-ordinate of mouse within window
   int k           //number of hyper-links currently on screen
) {
   struct hyperlink *p;     //pointer to a hyperlink structure
   FILE *f = NULL;          //pointer to the hyper-link URL

   for(p = HL; k > 0; k--, p++) {

      if(y > *p->y1 && y < *p->y2 && 
         x > *p->x1 && x < *p->x2)

      if(cursor == arrow) {
         [change mouse cursor from an arrow to a hand];
         f = *p->url;       //set hyper-link pointer to URL
      }
	   else if(cursor == hand) {
         [change mouse cursor from hand to arrow];
      }
   }
   return(f);               //return pointer to hyper-link URL
}

The hyper-text pointed to by the mouse, if any, has a corresponding URL stored within the HTML file. The function returns a file pointer which points to the text of the URL within the HTML file. If the mouse is not currently pointing at any hyper-text it returns a NULL file pointer.

HyperLink Function

A function is now needed which is called automatically whenever the left mouse button is pressed. This function must check whether or not the mouse pointer is over hyper-text, and if so, read the corresponding URL from the HTML file and ex­pedite the loading of the new file...

HTML Tag Filter

A filter function is needed to filter out all the HTML tags as they are encountered within the text stream of an HTML file. Then a set of functions is required to inter­pret and expedite each tag type. First the filter.


int TagFilter(             //FILTER HTML TAGS FROM TEXT STREAM
  int c                    //next character from HTML file
) {
  static char tag[256],    //array in which to capture a tag
  *t;                      //pointer to captured tag string
  static int tc,           //count of number of chars in tag
  o,                       //offset of params from start of tag
  e;                       //0 = start tag, 1 = end tag
  if(c == '<') {           //start delimiter for an HTML tag
    t = tag;               //set t to start of tag array
    tc = 1;                //set that we are in an HTML tag
  }
  else if(tc > 0) {        //if we are currently in a tag
    if(c == '>') {         //end-delimiter for an HTML tag
      *t = '\0';           //put null at end of tag string
      tc = 0;              //set that we are in normal text
      TagInt(tag, o, e);   //appropriate tag interpreter
    }
    else if(tc < 256) {    //tag still within length limit
      if(tc == 1)          //if this is first char after <
        if(c == '/')       //if first char is a slash, it
          e = 1;           //indicates that it is an end-tag
        else               //first char is not a slash, this
          e = 0;           //indicates it is a start tag
      else if(c == ' ') {  //if this character is a space
        c = '\0';          //make it a null
        o = tc;            //set offset to this char number
      }
      else if(c > 64 && c < 91)  //if character is a capital
        c += 32;                 //convert it to lower case
      *t++ = c;            //add char to tag buffer
      tc++;
    }
    else                   //if the tag string is too long
      tc = 0;              //so ignore the whole tag
    c = 0;                 //return a null char if in a tag
  }
  return(c);               //return given character or a null
}

As a character is received from the HTML file, it is passed to the above function. If it is part of the normal text, the above function simply returns it unchanged. How­ever, if it is part of an HTML tag, it captures the character and returns a null.

Match Check

First we need a function to compare a tag captured from the HTML file with a tag in the above array to see if they are the same, higher or lower in alphabetical order.


int MatchCheck(
  char *p,          //ptr to tag name captured from HTML file
  char *q           //ptr to tag name in array of valid tag names
) {
  int c,            //character within captured tag name
      d,            //character within tag name in array
      flag = 0;     //indicates whether same, higher or lower
  do {              //for each char of the shorter tag
    if(( c = *p++) != (d = *q++)) {   //if chars are not the same
      if(c > d)     //if char in captured tag > char in array tag
        flag = +1;  //set captured tag is higher than array tag
      else          //otherwise
        flag = -1;  //set captured tag is lower than array tag
      break;        //go and return the flag value
    }
  } while(c != 0 && d != 0);  //end loop if either char = null
  return(flag);
}

The do-loop tests at the end rather than at the beginning. This allows the null at the end of the shorter of the two tags to be compared with the corresponding character in the other tag before the loop termination test is done. Thus, if the tags are the same, e.g.

  base0        captured tag     0 = terminating null character
  base0        tag in array

then the last things compared are the two terminating nulls which are also the same. If the tags are of different lengths but the short tag is the same as the first part of the long one, e.g.

  base0        captured tag     0 = terminating null character
  basefont0    tag in array

then the terminating null of the shorter one is compared with the corresponding non-null character of the longer one. The inequality caught by the first if-statement causes the loop to be broken with the flag set to -1 or +1 as appropriate. If the tags are altogether different, e.g.

  base0        captured tag     0 = terminating null character
  body0        tag in array

then the first if-statement captures the first pair of unequal characters (in this case the a and the o) and the loop is again broken with the flag set to -1 or +1 as appro­priate.

Tag Search

The next task I have to do is search for the given tag within the list of valid HTML tags held in the tags[] array. As HTML is extended, this list will grow. It is therefore expedient to design this search to be as fast as possible. The search technique I shall adopt is therefore the binary slice search as illustrated below.

Data structure for tag search.

The binary slice search starts half way up the alphabetically sorted tag names list. It compares the tag from the HTML file (base) with the middle tag in the array. It does not match. The array element is too high. That is, it is too far along the alpha­bet (i.e. too far up the array because 'a' is at the bottom and 'z' at the top). It there­fore tries the element half way down the lower half of the array. It does not match this one either. It therefore divides the distance in half again. This time it hits the element containing the basefont tag name. This is still too high. It divides the skip distance in two again and tries that element. Eureka! It hits the element containing the tag name base. It notes its number within the tags array as the means of valid­ating and identifying the HTML tag.

The direction of the skip to the next element (i.e. whether it is up or down the array) is determined by whether the tag name from the HTML file is too low or too high compared with the current array element. Furthermore, when the skip dist­ance 'j' is halved, the result is truncated. This is best done simply by right-shifting the amount by one bit in its register. This ensures that the result will never stray out of the range of the array. When the skip distance gets down to 1, it is kept there. It is not divided and truncated to 0 otherwise the index would never move to the next element.

I must now build this binary search into a function which can be called by the filter function. It must, on presentation of a pointer to the HTML tag name captured from the HTML file, return the index number of the given tag name. If it cannot find the tag in its array of valid HTML tags, it must indicate this by returning -1. This allows the array of HTML tag name pointers, tags[] to be private to this function.

My rendering of the tag search function is as follows. The pointer 'p' points to a null terminated string which contains the captured tag's name only: not its arguments if any. There are 57 tag names. Arrays in C begin at element zero. The elements of tags[] are therefore 0 to 56. The mid-point element is number 28. This is the value to which k is initialised. The initial skip distance j is therefore half of this which is 14. These initial values must be updated if tag names are added to the array as HTML is extended.


int TagSearch(char *t, int e) {   //t points to captured tag name
  static char *tags[] = {         //all valid HTML tag names
    "a", "address", "applet", "area", "b", "base", "basefont",
    "bgsound", "big", "blink", "blockquote", "body", "br",
    "caption", "center", "cite", "code", "comment", "dd", "dfn",
    "dir", "div", "dl", "dt", "em", "font", "form", "frame",
    "frameset", "head", "hn", "hr", "html", "i", "img", "input",
    "isindex", "kbd", "li", "link", "listing", "map", "marquee",
    "menu", "meta", "nextid", "nobr", "noframes", "ol", "p",
    "param", "plaintext", "pre", "samp", "select", "small",
    "strike", "strong", "sub", "sup", "table", "td", "textarea",
    "th", "title", "tr", "tt", "ul", "var", "wbr", "xmp"
  };
  int k = 36,               //start point of search within array
      j = 18,               //first index shift amount
      x = -1;               //returned index number of found tag
  while(k >= 0 && k < 71) { //while index within array range
    if((int y = MatchCheck(t, tags + k) == 0) {
      x = k;                //set index number of the found tag
      break;                //break out of the while loop
    } else {                //if this array element did not match
      if(y > 0)             //if given tag was higher in alphabet
        k -= j;             //shift down the array
      else
        k += j;             //otherwise shift up the array
      if(j > 1)             //if the shift increment > 1 then
        j >>= 1;            //halve it, otherwise leave it = 1
    }
  }
  return(x);                //return index number of given tag
}                           //this is -1 if tag was not found

The while loop terminates the search if the given tag is not found. If the given tag matches that pointed to by the kth element of tags[], MatchCheck() returns a '0'. In this case, the while loop is broken to return the element number of the validated tag. If MatchCheck() returns a positive value, it means the given tag name was further up the alphabet than the kth element of tags[]. I therefore shift k down the array by an amount j. If MatchCheck() returns a negative value, I do the opposite. I then halve j by right-shifting it by one bit position ready for the next pass of the while loop.

Combined Function

MatchCheck() is called repeatedly within the while loop of TargSearch(). For the sake of speed, it is therefore expedient to see whether or not MatchCheck() can be incorporated into TagSearch() and thereby reduce the overall amount of code needed. And as can be seen below, indeed it can. In fact, in the event of a mis-match, the index k is now shifted inside the central do loop just before it is broken.


int TagSearch(char *t, int e) {  //points to name of captured tag
  static char *tags[] = {        //all valid HTML tag names
    "a", "address", "applet", "area", "b", "base", "basefont",
    "bgsound", "big", "blink", "blockquote", "body", "br",
    "caption", "center", "cite", "code", "comment", "dd", "dfn",
    "dir", "div", "dl", "dt", "em", "font", "form", "frame",
    "frameset", "head", "hn", "hr", "html", "i", "img", "input",
    "isindex", "kbd", "li", "link", "listing", "map", "marquee",
    "menu", "meta", "nextid", "nobr", "noframes", "ol", "p",
    "param", "plaintext", "pre", "samp", "select", "small",
    "strike", "strong", "sub", "sup", "table", "td", "textarea",
    "th", "title", "tr", "tt", "ul", "var", "wbr", "xmp"
  };
  int k = 36,                //start point of search within array
      j = 18,                //first index shift amount
      x = -1;                //returned index number of found tag
  while(k >= 0 && k < 71) {  //while index in array range
    int c,                   //character within captured tag name
        d,                   //character within tag name in array
        f = 1;               //indicates whether a match or not
    char *p = t,             //ptr to a char in the captured tag
         *q = tags + k;      //ptr to a char in the array tag
    do {                     //for each char of the shorter tag
      if((c = *p++) != (d = *q++)) {    //if chars not same
        if(c > d)            //if char in tag > char in array tag
          k -= j;            //abort test, shift down the array
        else
          k += j;            //otherwise shift up the array
        if(j > 1)            //if the shift increment > 1 then
          j >>= 1;           //halve it, otherwise leave it = 1
        f = 0;               //set flag to indicate a mis-match
        break;               //go and return the flag value
      }
    } while(c != 0 && d != 0);   //end loop if either char = null
    if(flag) {               //if whole tag matched
      x = k;                 //set index number of the found tag
      break;                 //break out of the while loop
    }
  }
  return(x);                 //return index number of given tag
}                            //x = -1 if the tag was not found


Tag Call

We must now build a function TagCall() to call a special function to deal with the particular captured tag once it has been found.

For this we need an array *funs[] which is an array of pointers-to-functions. Each of the functions thus pointed to takes a 'char' pointer and two integers as input parameters and provides no return value. The string values with which the array is initialised are the names of the functions which perform the appropriate processing in response to each respective HTML tag name.


void TagCall(
  char *p,      //points to the captured tag string
  int k,        //index number of the captured tag
  int e         //0 = it's a start tag, 1 = end tag
) {
  static void (*funs[])(char * , int, int) = {
    Fa, Faddress, Fapplet, Farea, Fb, Fbase, Fbasefont,
    Fbgsound, Fbig, Fblink, Fblockquote, Fbody, Fbr,
    Fcaption, Fcenter, Fcite, Fcode, Fcomment, Fdd, Fdfn,
    Fdir, Fdiv, Fdl, Fdt, Fem, Ffont, Fform, Fframe,
    Fframeset, Fhead, Fhn, Fhr, Fhtml, Fi, Fimg, Finput,
    Fisindex, Fkbd, Fli, Flink, Flisting, Fmap, Fmarquee,
    Fmenu, Fmeta, Fnextid, Fnobr, Fnoframes, Fol, Fp,
    Fparam, Fplaintext, Fpre, Fsamp, Fselect, Fsmall,
    Fstrike, Fstrong, 	Fsub, Fsup, Ftable, Ftd, Ftextarea,
    Fth, Ftitle, Ftr, 	Ftt, Ful, Fvar, Fwbr, Fxmp
  };
  (*(funs + k))(p, o, e);     //call appropriate tag function
}

The rather outlandish function call following the array calls the function whose addr­ess is the content of the kth element of the array funs[]. TagCall() would have to be called from TagFilter immediately after TagSearch had been called.

To speed things up, let us see if we can incorporate TagCall() into TagSearch() and thus simplify the whole process. All we need to do is to add the initialised array *funs[] to TagSearchl(), then replace the statement x = k; with the fancy func­tion call above. Note that we no longer need the variable x.

The augmented version of TagSearch() which has been renamed more approp­ri­ately TagInt (Tag Interpreter) is shown below.


void TagInt(   //THE TAG INTERPRETER FUNCTION
  char *t,     //pointer to start of captured tag string
  int o,       //offset of start of parameters from start of tag
  int e        //0 indicates a start tag, 1 indicates and end tag
) {
  static char *tags[] = {  //array of all 71 valid HTML tag names
    "a", "address", "applet", "area", "b", "base", "basefont",
    "bgsound", "big", "blink", "blockquote", "body", "br",
    "caption", "center", "cite", "code", "comment", "dd", "dfn",
    "dir", "div", "dl", "dt", "em", "font", "form", "frame",
    "frameset", "head", "hn", "hr", "html", "i", "img", "input",
    "isindex", "kbd", "li", "link", "listing", "map", "marquee",
    "menu", "meta", "nextid", "nobr", "noframes", "ol", "p",
    "param", "plaintext", "pre", "samp", "select", "small",
    "strike", "strong", "sub", "sup", "table", "td", "textarea",
    "th", "title", "tr", "tt", "ul", "var", "wbr", "xmp"
  };
  static int (*funs[])(char *, int, int) = {
    Fa, Faddress, Fapplet, Farea, Fb, Fbase, Fbasefont,
    Fbgsound, Fbig, Fblink, Fblockquote, Fbody, Fbr,
    Fcaption, Fcenter, Fcite, Fcode, Fcomment, Fdd, Fdfn,
    Fdir, Fdiv, Fdl, Fdt, Fem, Ffont, Fform, Fframe,
    Fframeset, Fhead, Fhn, Fhr, Fhtml, Fi, Fimg, Finput,
    Fisindex, Fkbd, Fli, Flink, Flisting, Fmap, Fmarquee,
    Fmenu, Fmeta, Fnextid, Fnobr, Fnoframes, Fol, Fp,
    Fparam, Fplaintext, Fpre, Fsamp, Fselect, Fsmall,
    Fstrike, Fstrong, 	Fsub, Fsup, Ftable, Ftd, Ftextarea,
    Fth, Ftitle, Ftr, 	Ftt, Ful, Fvar, Fwbr, Fxmp
  };
  int k = 36,               //Start search half way up array
      j = 18,               //first index shift amount
      x = 0;                //default return from tag function
  while(k >= 0 && k < 71) { //while index in array range
    int c,                  //character within captured tag name
        d,                  //character within tag name in array
        f = 1;              //indicates whether a match or not
    char *p = t,            //ptr to a char in the captured tag
         *q = *(tags + k);  //ptr to a char in the array tag
    do {                    //for each char of the shorter tag
      if(( c = *p++) != (d = *q++)) {       //if chars not same
        if(c > d)           //if char in tag > char in array tag
          k -= j;           //abort test, shift down the array
        else
          k += j;           //otherwise shift up the array
        if(j > 1)           //if the shift increment > 1 then
          j >>= 1;          //halve it, otherwise leave it = 1
        f = 0;              //set flag to indicate a mis-match
        break;              //go and return the flag value
      }
    } while(c != 0 && d != 0);    //end loop if either char null
    if(f) {                       //if whole tag matched
      x = (*(funs + k))(p, o, e); //call tag function
      break;                      //break out of outer while loop
    }
  }
  return(x);                //pass back the tag function's return
}


Character Entity Capture and Substitution

Some of the older links in the Internet can only deal with 7-bit characters. HTML has a set of 'long-hand' character entities to represent characters beyond the ASCII value 127. Modern browsers substitute and display the actual 8-bit characters re­presented by these character entities.

Substituting ASCII Value For Character Entity

In the following C function() an array is declared which contains pointers to char­acter strings. Each string pointed to contains one of the valid HTML character enti­ties. A second array is declared. This is an array of integers. Each element of this second array contains the 8-bit ASCII value of the character entity pointed to by the corresponding element of the first array.

The method() receives a pointer s to a string containing a given character entity. It then searches the first array for this given character entity. It does this using a binary slice search method. On finding the given character entity in the array it notes its position within the array, that is, its index number. It then looks in the cor­responding element of the second array to find the given character entity's 8-bit ASCII value. This it then returns to the function which called it.

int GetEnt(char *s) {
  const char *CE[] = {       //valid HTML character entities
    "AElig",  "Aacute", "Acirc",  "Agrave", "Aring",  "Atilde",
    "Auml",   "Ccedil", "ETH",    "Eacute", "Ecirc",  "Egrave",
    "Euml",   "Iacute", "Icirc",  "Igrave", "Iuml",   "Ntilde",
    "Oacute", "Ocirc",  "Ograve", "Oslash", "Otilde", "Ouml",
    "THORN",  "Uacute", "Ucirc",  "Ugrave", "Uuml",   "Yacute",
    "aacute", "acirc",  "acute",  "aelig",  "agrave", "amp",
    "aring",  "atilde", "auml",   "brvbar", "ccidil", "cedil",
    "cent",   "copy",   "curren", "deg",    "divide", "eacute",
    "ecirc",  "egrave", "eth",    "euml",   "frac12", "frac14",
    "frac34", "gt",     "iacute", "icirc",  "iexcl",  "igrave",
    "iquest", "iuml",   "laquo",  "lt",     "macr",   "micro",
    "middot", "nbsp",   "not",    "ntilde", "oacute", "ocirc",
    "ograve", "ordf",   "ordm",   "oslash", "otilde", "ouml",
    "para",   "plusmn", "pound",  "quot",   "raquo",  "reg",
    "sect",   "shy",    "sup1",   "sup2",   "sup3",   "szlig",
    "thorn",  "times",  "uacute", "ucirc",  "ugrave", "uml",
    "uuml",   "yacute", "yen",    "yuml"
  };
  const unsigned char ce[] = {
    198, 193, 194, 192, 197, 195, 196, 199, 208, 201, 202, 200,
    203, 205, 206, 204, 207, 209, 211, 212, 210, 216, 213, 214,
    222, 218, 219, 217, 220, 221, 225, 226, 180, 230, 224, 038,
    229, 227, 228, 166, 231, 184, 162, 169, 164, 176, 247, 233,
    234, 232, 240, 235, 189, 188, 190, 062, 237, 238, 161, 236,
    191, 239, 171, 060, 175, 181, 183, 160, 172, 241, 243, 244,
    242, 170, 186, 248, 245, 246, 182, 177, 163, 034, 187, 174,
    167, 173, 185, 178, 179, 223, 254, 215, 250, 251, 249, 168,
    252, 253, 165, 255
  };
  int k = 50,       //half-way point to start search within array
      j = 25,       //first index shift amount
      x = 0;        //default list number of character entity

  while(k >= 0 && k < 101) {  //while index within array range
    int c,                    //char within captured char entity
        d,                    //char in char ent string in array
        f = 1;                //indicates whether a match or not
    char *p = s,              //ptr to a char in captured entity
         *q = *(CE + k);      //ptr to a char in the array string
    do {                      //for each char of shorter string
      if((c = *p++) != (d = *q++)) {    //if chars not the same
        if(c > d)             //if char in entity > char in array
          k -= j;             //abort test and shift down array
        else
          k += j;             //otherwise shift up the array
        if(j > 1)             //if the shift increment > 1 then
          j >>= 1;            //halve it, otherwise leave it = 1
        f = 0;                //set flag to indicate a mis-match
        break;                //break out of the do-loop
      }
    } while(c != 0 && d != 0);//end loop if either char is a null
    if(f) {                   //if there was a match
      x = *(ce + k);          //get char's ISO 'ASCII' value
      break;                  //break out of the main while loop
    }
  }
  return(x);                  //return the entity's ASCII value
}

Capturing an HTML Character Entity

Before an extended character's ASCII value can be substituted for a given HTML character entity, the character entity itself must first be captured from the text stream of the incoming HTML document file. The following function is a filter. It captures each HTML character entity as it arrives. It then calls GetEnt(s) above to get the entity's 8-bit ASCII value. This it then passes to its outgoing text stream.

Character entities have a numeric equivalent. These are typed as a # followed by up to 3-digits followed by a semicolon. An example would be &#182;. This is the same as the normal entity &para; to create the sign. These numeric entities are worked out by the following function without calling GetEnt(s).

int CharEntConv(int c) {  //next character from input text stream
  static char CE[8],      //array for captured character entity
              *p;         //ptr to captured char entity string
  static int  cc;         //count of No of chars so far in entity

  if(c == '&') {          //start delimiter for HTML char entity
    p = CE;               //set p to start of char entity array
    cc = 1;               //set that we are within a char entity
  }
  else if(cc > 0) {       //if we are currently in a char entity
    if(c == ';') {        //end-delimiter for an HTML char entity
      *p = '\0'           //put null at end of tag string
      cc = 0;             //set that we are in normal text
      p = CE;             //set pointer to start of char entity

    /* If the first character of the entity is a # then it
       is in numeric format, so assemble the following digit
       characters into a numeric ASCII 'quantity', c.
       Otherwise the character entity is an acronym, so use
       GetEnt() to get its ASCII value instead. */

      if((c = *p++ ) == '#') {         //If first char is a #
        c = 0;                         //initialise ASCII value
        while((int a = *p++) > '\0')   //until null terminator
        {
          if(c > 0) {                  //If not first digit
            c <<= 1;                   //multiply c by 10 and
            c += (c <<= 2);
          }
          c += a - 47;                 //add in current digit
        }
      } else c = GetEnt(CE);           //First char not a #
    }
    else {           //we are in the midst of a character entity
      if(cc < 8)     //character entity still within length limit
        *p++ = c;    //add char to character entity buffer
      else           //if the character entity string is too long
        cc = 0;      //ignore the whole character entity
      c = 0;         //return a null character
    }
  }                  //else return the ISO character
  return(c);         //equivalent of the char entity
}

© 2000 Robert John Morton