Next: Strings and String Fields Up: I/O of Data Structures Previous: An Introduction

Arrays

The first case study that we'll investigate is how to write and read arrays. We will begin with fixed-size one-dimensional arrays of integers, and then generalize our methods until we are able to deal with arrays of unknown dimension, extent, and type.

Arrays of Fixed Size

For the one-dimensional case, the solution is very easy- in C, one-dimensional arrays are stored as a contiguous block of memory. The standard I/O library makes it easy to read and write such blocks (i.e. using fread and fwrite). The only complication can occur when the arrays are of unknown extent or type, but these situations are discussed in later sections.

The two-dimensional case is more interesting, however, and is worth considering for a moment. If we were always reading and writing two-dimensional arrays of the same size (for example, NUM_ROWS by NUM_COLS, where NUM_ROWS and NUM_COLS are fixed numbers that everyone has agreed to), then reading and writing arrays would be a simple matter, as illustrated in figure 6.1.

Figure 6.1: Code to read and write 2-dimensional arrays of fixed size.

Arrays of Unknown Extent

For many purposes, arrays of a fixed size are sufficient (at least until someone changes NUM_ROWS or NUM_COLS). However, in many cases, it is impossible (or impractical) to know ahead of time what the extents of the array actually are. Therefore, something we need to store (in addition to just the data in the array) is the size of the array itself.

One way to do this is to store the number of rows and columns of the array as the first two integers in the file, followed by the data, row by row.

A sketch of an implementation using the array2d_t structure from the Pointers and Malloc handout is shown in figure 6.2. This code omits all of the error checking that is generally necessary when using I/O, but gives the basic algorithm.

Figure 6.2: Code to read and write 2-dimensional arrays of unknown size. Note the complete lack of error checking.

The first two integers, which represent the number of rows and columns, form a header for this serialized array; they tell the reader the information that they need to know in order to read the array. More information can be added to the header, if necessary; for example, an integer representing the type of each data element (char, int, or any other C type) could be used if the type of the array was unknown ahead of time.

Similarly, information about the format of the data elements themselves (size and byte-order) can be stored in the header, so that programs on machines with differently sized integers or different binary representations for integers can read the data correctly.

Arrays of Unknown Dimension

The algorithms from the previous sections are jury-rigged to work with two-dimensional arrays. They can be extended to work with arrays of higher dimension, using similar techniques, but there is another problem yet to be solved- how can we deal with an array that has an unknown number of dimensions?

One problem that would immediately arise is how to represent such an object in C in the first place- all the arrays we have seen so far have had a static number of dimensions, known at compile-time, and this is generally the way C programs are written. It is possible to write code that does otherwise, but I will not pursue this topic in this chapter. Instead of focusing on the C tricks necessary to use such arrays in C, I will focus on the methods used to read and write these arrays.

If the number of dimensions is known to be small, then a fixed-size header can be used, similar to the header used in the previous section to handle arrays of unknown extent. This header can list the extent of each dimension, with a sentinel value of zero to indicate the highest dimension, if it was less than the limit.

This technique works well if the number of dimensions can be bounded. However, this is not always the case. In this case, the approach is to use a dynamic header- the list of dimensions and their extents is itself a dynamic 1-dimensional array, represented in the manner shown in the previous section. In this representation, the input file contains:

Any information about the byte-order, word-size, etc of the represented data.
An integer specifying the number of dimensions.
An array of integers representing the extent in each dimension.
The data itself.

Arrays of Unknown Type

We have made a pair of simplifying assumptions in our discussion up to this point: first, we have assumed that both the reader and the writer implicitly know what the type of the array is (i.e. int, double, etc), and second that both the reader and writer use the same representation for these types.

This is not always the case. For example, different machines use different byte orders or different numbers of bytes to represent multi-byte objects, and different machines may use different binary representations to represent bytes (including bytes of different sizes).

In order to successfully read data stored in the native binary representation of the writer, the reader must know what that format is (and how a translation into its own native format can be performed). Alternatively, all writers (no matter what their native representation) can write in a canonical format that all readers (no matter what their native representation) can read.

If writers always write in their own native format (and the readers perform whatever translation is necessary), then a marker must be added to the data that indicates exactly what representation was used to write it, so that the reader can choose the appropriate reading routines. This marker must be chosen in such a way (and written in a canonical, universally agreed-upon manner) so that all readers can unambiguously determine the way that the file was written.

Data Definition Languages

We began this section with an example where the reader and the writer implicitly agreed on the dimension, size, and type of the arrays, as well as the underlying representation of that type. By the final example in the previous section, all of the implicit assumptions had been removed, except one- the writer and reader have implicitly agreed that what is being written and read are arrays.

In some applications, even this assumption is invalid- the reader may have no idea what kind of thing the writer has left in the file, and may have to learn the type and format of the data from the data itself. This problem can be solved by adding still more information to the header, so that the data includes a description of its format, explaining how it should be interpreted by any readers. This description, however, must be in a form that is implicitly shared by the readers and writers- if the writer doesn't describe the data in a way that the reader can understand, then the data is useless to the reader.

Next: Strings and String Fields Up: I/O of Data Structures Previous: An Introduction

Dan Ellard
Mon Jul 21 22:30:59 EDT 1997