Next: Binary Trees Up: I/O of Data Structures Previous: Arrays

Strings and String Fields

For this case study, we'll investigate how we can read and write records containing information represented by strings. In a general sense, strings are simply one-dimensional arrays, and can be handled as such, using the methods that we have already seen. Typically, however, there is useful information (such as information about the representation of the string) encoded in the string itself, and therefore we will consider strings as a separate case.

Delimited Strings

Imagine that we have a set of records that we wish to store in a file. Each record consists of a fixed number of fields, and each field consists of a string. Since some (or all) of the fields are of varying length, however, each record itself is represented by a string. How can we tell where each record and each field in the record begins and ends?

There are a number of approaches to solving this problem. The most common one, at least in the UNIX world, is to have each line in the file represent a single record, and have the boundaries between each field in the record denoted by a special string (usually just a single character like ":") called a field delimiter (or just a delimiter). For example, here's what a small file that represents a phone book might look like:

	Ellard:Dan:496-6246
	Ellard:Penny:643-9644

Writing records to a file in this format is simple; for each record, just do something along the lines of:

        fprintf (fout, "%s:%s:%s\n", lastname, firstname, telephone);

Figure 6.3: Code to split a string into fields.

Figure 6.3 is an example of an implementation of some code to read in a file of this form. You should note the following things about this code:

It allows "comments" to be intermixed with the data. A comment line is any line that has a '#' as its first character.
The logic for figuring out where each field begins and ends could be done with a single call to sscanf, and in fact the entire process of reading in each line and chopping it up into fields could be accomplished with a single call to fscanf. It might be interesting for you to figure out what the format string you would need to pass in to fscanf would be in order to accomplish this. Alternatively, finding the start and end of each field individually can be accomplished via repeated calls to scanf (or fscanf).

Given that I could have used fscanf (or some member of the scanf family, or strtok), it may seem odd that I did not. (In general, doing things the easy way is always my choice.) The reason is that these functions, while handy for many simple purposes, are hard to extend. In the next section, we'll extend these ideas significantly.

Escaped Strings

In the code in figure 6.3, there is no way to represent a field that contains the delimiting character itself. For example, if you ever try to enter a last name that has a ':' in it, our phone book routines will fail. In some applications, this is not a problem; if there exists a delimiter character that can be chosen that will not, due to some property of the data, ever appear in the data, then this approach is sufficient.

In the more general case, however, it might not be possible to choose a delimiter character at all, or at the very least it may not be possible to choose one that will work for all applications. Therefore, in this section we will discuss how delimited strings can be implemented in a manner which allows all of the characters in the character set to appear in any field.

The usual way to do this is to escape the field delimiting character by adding a special character or string (the escape character) before each character that needs to be "protected". You have already seen this technique in practice; printf uses the % character to escape the meaning of other characters.

Unfortunately, often it is unacceptable to use exactly the same scheme as printf (which uses % as the escape character, and escapes % itself with %%), since we may well want to allow two adjacent instances of the delimiter character to signify an empty field. For example, we might want ``::'' to signify an empty field, rather than a single ``:'' (but of course, we still want a way to represent a single ``:''). What can be done is to use another character or string as the escape character, instead of using the delimiter as the escape.

The character that is most typically used in UNIX is backslash (\). So, if we ever want to represent the name of some eccentric named "foo:bar" in our system, his name could be stored as "foo\:bar".

Of course, this immediately leads to another problem (that might not be immediately obvious). What if we then have to deal with another eccentric whose last name ends in "\"? The answer, again, is that we need to escape the escape character itself! Luckily, we can use the escape character to escape itself (unlike the delimiter character, which cannot escape itself), so there the saga ends.

In our example code, we actually have two delimiting characters: the ``:'', and the newline. If we want to allow the user to embed newlines in their strings, then we can still accomplish this with the same method of escaping the newline- but we'd need to modify the code somewhat, because it uses fgets everywhere, and fgets treats newlines specially. However, despite the fact that it would require a little more coding to get working, algorithmically we now know exactly what we would need to do.

Annotated Strings

In many cases, it may be necessary to escape sequences of multiple characters. These substrings are annotations of the text. (In different contexts, different terms are used to describe this concept, but I'm going to use this single term to try to be consistent).

For example, in HTML the formatting commands are embedded directly in the text of the web pages, as annotations. In HTML, annotations generally begin with a "<" and end with a ">". In C, you can think of comments as annotations that begin with /* and end with */ (although of course this fails to consider what happens when these sequences occur inside C strings). In these cases, the start and ending delimiters of annotations are different, but in some applications the same character is used to delimit both the start and end of an annotation (as double quotes are used to represent both the start and end of strings, for example).

Annotated strings are slightly more complicated to deal with than escaped strings, because there is (in the general case) no way to know ahead of time how long each annotation is. However, a state machine can be used to process the string and keep track of what sections represent normal text and what sections represent any annotations.

Note that annotated strings are often escaped as well: otherwise, there would be no way to represent the annotation string itself in the text. For example, in HTML, the escape character is "\&".

Note that although HTML is a convenient (and ubiquitous) example of annotated text, it is also not a very good example, and certainly not a clean design. Instead of simply escaping the delimiters, each has its own special escape (for example, the string "<" is the escaped form of "<"). In addition, despite the fact that the escape character can be used to escape itself, the HTML escape character is escaped differently, as "&".

Next: Binary Trees Up: I/O of Data Structures Previous: Arrays

Dan Ellard
Mon Jul 21 22:30:59 EDT 1997