Design Notes for 800.c

This document discusses several aspects of the 800.c program that was part of assignment 8.

The reader is assumed to be familiar with this assignment, so there's not a lot of discussion of the general structure of the program here; this document concentrates on several interesting parts of the code.


Contents


Converting from Letters to Numbers

The focus of this section is the part of the algorithm that converts the strings that the user types in (which may consist of any combination of letters and numbers) into strings that consist entirely of numbers (which can be searched for, via binary search, in the array of 800 numbers) according to the mapping used by touch-tone telephones.


A description of the problem

For the sake of argument, I'll imagine that the way that we've chosen to write this piece of the code is described by the following design:

	For each character in the string buf
		if the character is "alphabetic", then
			convert the character to uppercase,
			find the corresponding "digit",
			and replace the character with this digit.

This will all take place in one loop, except that the algorithm for finding the corresponding digit will be implemented in a separate function. For example, the code might look like:

	/*
	 * Given integer i, and string buf:
	 */

	for (i = 0; i < strlen (buf); i++) {
		if (isalpha (buf [i])) {

			/*
			 * Note-- on a few (buggy) systems, it is an
			 * error to convert a letter to uppercase
			 * without checking first to make sure that it
			 * is lowercase, because the conversion
			 * assumes that the character is a letter and
			 * is lowercase...  and if it isn't then the
			 * results are somewhat unpredictable.  This
			 * code assumes that the version of toupper
			 * we're using is robust.
			 */

			buf [i] = toupper (buf [i]);

			buf [i] = uppercase_letter_to_digit (buf [i]);
		}
	}

All that remains is to implement the function uppercase_letter_to_digit. This can be done in a number of ways. I'll discuss several that I think are appropriate, and then a few that I think are not appropriate.


A table lookup

The first algorithm we'll explore is a table lookup. I personally find this approach to this problem to be the most elegant- short, simple, and easy to get right.

In order to do table lookup, we must first have a table. We'll construct an array that has letters (or something related to letters) as its index, and the corresponding numbers (as ASCII digits) as its values. Letters that do not correspond to any numbers on the telephone keypad will not be converted.

The first task is to figure out how to compute the index into the array for each letter. Luckily, the ASCII character set has the (extremely useful) property that the codes that represent uppercase letters are arranged alphabetically and contiguously. This means, for example, that if we know the ASCII code for 'A', then we can compute the ASCII code for 'B' by simply adding 1 to the code for 'A', because 'B' immediately follows 'A' in alphabetical order. Similarly, the code for 'D' will be 3 more than the code for 'A', and so on, up until 'Z', which will be 25 more than the code for 'A'. (Similar properties hold for the lowercase letters and for digits, which is also extremely useful- for example, we used this property of digits in bignum_const from assignment 5).

This gives us a way to easily translate the letters 'A' through 'Z' to the numbers 0 through 25-- simply subtract the code for 'A'. For example, if c is a character representing some uppercase letter, then the following C expression will set n to a number from 0 to 25, representing the position in the alphabet of c:

	n = c - 'A';

Now all that remains is to construct a table, such that the entry in position zero (for 'A') corresponds to the code for 'A' on the telephone keypad, and so forth. This table is the following (assuming that Z and Q are not mapped):

letter		A B C D E F G H I J K  L  M  N
index		0 1 2 3 4 5 6 7 8 9 10 11 12 13
character	2 2 2 3 3 3 4 4 4 5 5  5  6  6

letter		O  P  Q  R  S  T  U  V  W  X  Y  Z
index		14 15 16 17 18 19 20 21 22 23 24 25
character	6  7  Q  7  7  8  8  8  9  9  9  Z

Since all of the elements of this table are simply characters, the table itself can be a string. In C, we can implement the table as follows:

	char	*table	= "2223334445556667Q77888999Z";

Now we can write our uppercase_letter_to_digit function:

char		uppercase_letter_to_digit (char c)
{
	char	*table	= "2223334445556667Q77888999Z";

	return (table [c - 'A']);
}


A word about error checking

Note that the previous version of our function does no error checking whatsoever. Due to the way that this function is used in our program, we are very sure that the input variable c is valid uppercase letter. However, it is nearly always better to be safe than sorry, so it would be better style to add some error checks to this code. This section discusses error checking briefly, in relation to this particular implement; it is not discussed in detail for any of the other approaches, although it is mentioned in passing.

An important decision that must be made, when doing any sort of error checking, is what to do when an error is detected.

One way to make this function robust to make it capable of handling any possible input, at least in some sense. For this function, this is easy, and only requires a small redefinition of what the function is supposed to do, to define what it does when handed a non-uppercase letter. For example, if c is not uppercase, then perhaps we should just leave it alone:

char		uppercase_letter_to_digit (char c)
{
	char	*table	= "2223334445556667Q77888999Z";

	if (isupper (c)) {
		return (table [c - 'A']);
	}
	else {
		return (c);
	}
}

We might want our function to be even more crabby than that; for example, in this program, if upper_letter_to_digit is called with a non-uppercase letter, then it means that something has gone wrong somewhere else, and we might as well find out about it now, before any damage is done. The following function uses assert to enforce this; the program will exit (with a possibly useful warning message) if the assumptions of this function are not met:

char		uppercase_letter_to_digit (char c)
{
	char	*table	= "2223334445556667Q77888999Z";

	assert (isupper (c));

	return (table [c - 'A']);
}


A table search

While table lookup (as described previously) may be the most elegant algorithm, table search is probably the most intuitive and natural- because it is how people (at least, everyone I know) actually does this conversion between letters and numbers. When confronted with a telephone number that is given in the form of some cute word or slogan, I find myself staring at the telephone keypad, searching for the digit that corresponds to each letter, as listed on the keys. (If I had to do this more than once every few weeks, I suppose I'd commit the string used in the previous algorithm to memory- but I don't, so I haven't.) It is not hard to express this search as an algorithm, and then implement it.

Once again, we construct a table, but this time the contents of the table resembles (in some sense) the telephone keypad itself:

	Digit  Letters
	-----   -----
	  0
	  1
	  2	A B C
	  3	D E F
	  4	G H I
	  5	J K L
	  6	M N O
	  7	P R S
	  8	T U V
	  9	W X Y

We can express this in C as an array of strings, where the index of each string is digit, and the strings contain all of the letters corresponding to that digit's key:

	char	*table [] = {
			"", "",
			"ABC", "DEF", "GHI", "JKL",
			"MNO", "PRS", "TUV", "WXY"
		};

To make the table lookup a little easier, we'll add a NULL entry to the end of the table as a sentinel value.

We can check each of these table entries in turn to see if the character we're looking for resides there by using the strchr function. Once we know the index, we'll convert this index into once of the characters '0' through '9' via the same sort of transformation that we used previously (in this case, the opposite of what we used earlier in the semester in bigadd_const). This gives the following implementation:

char		uppercase_letter_to_digit (char c)
{
	char	*table [] = {
			"", "",
			"ABC", "DEF", "GHI", "JKL",
			"MNO", "PRS", "TUV", "WXY",
			NULL
		};
	int	i;

	for (i = 0; table [i] != NULL; i++) {
		if (strchr (table [i], c)) {
			return ('0' + i);
		}
	}

	/*
	 * If we've failed to find any match, then just return c
	 * itself.
	 */

	return (c);
}

An equation

A popular algorithm (at least in the assignments that I saw) involved coming up with an equation that described the relationship between uppercase letters and digits on the telephone keypad. This is boarderline bad style, however- unless the relationship is quite well commented, it's not at all clear what is going on in the code, and was a large source of magic constants (which are always a no-no).

This approach also has the disadvantage that it is hard to modify. Of course, for the 800 application, this is not likely to be much of an issue- telephone keypads aren't likely to change any time soon. However, if you wanted to reuse some of this algorithm (or at least the design behind it) for some other sort of keypad, you'd have to come up with an entirely new relation, and there may not always be such a convenient one. With the table-oriented approaches, dealing with a new keypad or character set is simply a matter of changing the table; nothing else has to change in the slightest. In fact, you could write a utility function that did this translation and took the table itself as a parameter, and so you could use the same function for any number of different mappings.

However, since it seems so popular, it's worth discussing the "equation" approach.

In general, the relation is simple: each digit, starting at '2', gets the next three letters (starting with 'A'). A glitch occurs at '7', however, due to skipping over 'Q', and this wrecks the simplicity of the relationship. 'Z' is a different kind of nuisance- we have to remember that it doesn't correspond to any digit.

If it wasn't for 'Q' and 'Z', we'd just have the following relationship:

	(position of digit) == ((position of letter) / 3) + 2

In this relationship, "position of" represents the position of a digit in the sequence 0..9 or the position of a letter in the sequence A..Z. Again taking advantage of the way the ASCII codes are arranged, we can write this as:

	digit - '0' == ((letter - 'A') / 3) + 2

So if we have the letter, we can find the digit by doing a little algebra:

	digit = '0' + ((letter - 'A') / 3) + 2;

If this was there was to the problem, this would be a fine solution: short, simple, and easy to get right. However, now we must deal with 'Q' and 'Z', in order to fit the reality of the telephone keypad.

Handling 'Z' is relatively simple, because it hangs off the end; it is an exception to the relationship, but doesn't break the relationship:

	/*
	 * If the letter is a 'Z', then leave it alone.
	 */

	if (letter == 'Z') {
		return ('Z');
	}
	else {
		return ('0' + ((letter - 'A') / 3) + 2);
	}

Dealing with 'Q' is hairier. One approach is to break the relationship into two relationships: one for letters before 'Q', and another for those after. The relation for letters after 'Q' is similar to the other, except that the letters are "off by one". There are a number of ways that this can be expressed; here is one that seems relatively clean:

	/*
	 * If the letter is a 'Z' or 'Q', then leave it alone.
	 */

	if (letter == 'Z' || letter == 'Q') {
		return (letter);
	}
	else if (letter < 'Q') {
		return ('0' + ((letter - 'A') / 3) + 2);
	}
	else {
		return ('0' + (((letter - 1) - 'A') / 3) + 2);
	}

We'd probably write this a little differently in our function, however, in order to make it a little more robust:

char		uppercase_letter_to_digit (char c)
{
	/*
	 * If c is a 'Z' or 'Q', then leave it alone.
	 */

	if (c == 'Q' || c == 'Z') {
		return (c);
	}
	else if (c >= 'A' && c < 'Q') {
		return ('0' + ((c - 'A') / 3) + 2);
	}
	else if (c > 'Q' && c < 'Z') {
		return ('0' + (((c - 1) - 'A') / 3) + 2);
	}
	else {
		return (c);
	}
}

One thing to note about this function is that the first condition (which deals with 'Q' and 'Z') is actually completely redundant; any instances of 'Q' and 'Z' would otherwise be handled identically by the else statement at the end. I prefer, for stylistic reasons, to structure the code this way, so that all letters are handled explicitly (and the else is just for all other non-letter characters). It also helps to clarify (at least to me) exactly why some of the relational operators in the conditionals are < or > instead of <= or >= (or vice versa)- it's because these two letters are special.

It's also true that some of the other conditions are mildly redundant: for example, in the second condition, it is not strictly necessary to write letter < 'Q'; we could have written this as letter <= 'Q' instead and gotten away with it. Writing it this way (being careful with each condition) does allow a nice property, however- we can reorder these conditions arbitrarily and still get exactly the same results. This is a very nice property for code to have, since it makes it easier to modify and extend in the future, as well as easier to understand right now.

This code, of course, is a mess of magic numbers, and in this form cannot avoid style deductions (or peer ridicule). A few macros help out here quite a bit:

#define	LETTERS_PER_DIGIT	3
#define	DIGITS_OFFSET		2
#define	DIGIT_POS_TO_ASCII(d)	('0' + (d))
#define	ASCII_TO_LETTER_POS(l)	((l) - 'A')

But the result is still a little muddled, and 'Q' and 'Z' are still magic all by themselves, and this is hard to avoid.


Several bad approaches


The Binary Search

At the core of the 800 application is the binary search routine that actually does the searching. A binary search function for an array of integers was provided as part of the assignment; it was necessary to modify this function to work with strings as part of this assignment.

Some people modified their binary search functions to do quite a bit more than just a binary search through an array of strings, however- some designs delegated the work of translating the letters to digits, trimming the input string, etc to this function. One implementation also implemented the user interface inside the binary search function! For quick-and-dirty programming, this is not too terrible, as long as it gets the right answer. As a design, however, it violates an important principle: each function should do one thing, and do it well.

In this case, the one thing that the binary search function needs to do is search in a sorted array of strings for a string that has a given prefix. Such a function is of general use. A function that does binary search and character conversion and who knows what all else is entirely specific to the 800 application, and cannot be reused for other programs without a great deal of work.


Please report and errors or inconsistencies in this document to the author.

Dan Ellard - ellard@deas.harvard.edu