Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Advanced C 1992

.pdf
Скачиваний:
96
Добавлен:
17.08.2013
Размер:
4.28 Mб
Скачать

Part II • Managing Data in C

are read, the program does its comparisons (taking into consideration possible end-of- file conditions), and writes the correct record. When the program reaches the end of both input files, it closes all the files and ends. It is a simple program that works quickly.

When writing a merge function, you must consider that one file may be (and usually is) shorter than the other. The merge program must be sure that the longer file’s records are written to the output.

Purging

One often needed (and hard to find) program is a purge program, which is used to delete duplicates (sometimes called de-dup) from a file. You might want to delete duplicates, for example, from a mailing list or a word list.

The PURGFILE.C program in Listing 10.3 performs two functions. Part of the program works like MERGFILE (Listing 10.2). Unlike MERGEFILE, however, PURGFILE does not write duplicates to the output file.

Listing 10.3. PURGFILE.C.

/* PURGFILE, written 1992 by Peter D. Hipson

*This program merges and purges in one step. If your

*PC has memory models, you must compile with the

*LARGE model.

*/

 

#include <stdlib.h>

// For standard functions

#include <stdio.h>

// Make includes first part of file

#include <string.h>

// For string functions

#include <process.h>

// For exit(), etc

#include <malloc.h>

// For malloc(), calloc(), realloc(), free()

#include <search.h>

// For qsort()...

int main(int argc, char *argv[], char *envp[]);

int compare(char **arg1, char **arg2);

#define

BIGEST_LINE

512

/*

The largest

readable line */

#define

NEED_RECORD

1

/*

A record is

needed from the file */

336

Data Management: Sorts, Lists, and Indexes

#define

END_OF_FILE

2

/*

This file

is finished

*/

#define

ALL_OK

3

/*

No record

needed, not

EOF */

/* Although these variables are defined as external, they could

*be defined inside the function or allocated dynamically,

*depending on the program’s needs and available memory.

*/

char

szInput[BIGEST_LINE];

char

szInput1[BIGEST_LINE];

char

szInput2[BIGEST_LINE];

int main(

int

argc,

char

*argv[],

char

*envp[]

)

 

{

 

FILE

*InFile1;

FILE

*InFile2;

FILE

*OutFile;

char

szProgram[30];

C C C

C10C C

C C C

/* Strings for _splitpath(), which parses a file name */

char

szDrive[_MAX_DRIVE];

char

szDir[_MAX_DIR];

char

szFname[_MAX_FNAME];

char

szExt[_MAX_EXT];

int

i;

int

j;

int

nCompare = 0;

int

nFileOneStatus = NEED_RECORD;

int

nFileTwoStatus = NEED_RECORD;

/*

Use fprintf(stderr...) to force prompts and error messages to be

* displayed on the user’s screen regardless of whether the output

continues

337

Part II • Managing Data in C

Listing 10.3. continued

* has been redirected. */

_splitpath(argv[0], szDrive,

szDir,

szFname,

szExt);

strncpy(szProgram, szFname, sizeof(szProgram) - 1);

if (argc <= 3)

{

fprintf(stderr,

“\n”

“%s -\n” “\n”

“Peter’s PURGEFILE: Merges two sorted files, \n”

purging all duplicate lines!\n”

“\n”

inputfile1 and inputfile2 can be the same file,\n”

if you want to de-dup only one file.\n”

“\n”

syntax: \n”

“\n”

%s inputfile1 inputfile2 outputfile \n”

“\n”,

szProgram,

szProgram);

return(16);

}

InFile1 = fopen(argv[1], “rt”);

InFile2 = fopen(argv[2], “rt”);

OutFile = fopen(argv[3], “wt”);

while (

nFileOneStatus != END_OF_FILE || nFileTwoStatus != END_OF_FILE)

338

Data Management: Sorts, Lists, and Indexes

C C C

 

10C

 

C C C

 

C C

{

while(

nFileOneStatus == NEED_RECORD || nFileTwoStatus == NEED_RECORD)

{

switch(nFileOneStatus)

{

case NEED_RECORD: /* Read a record */

if (fgets(szInput, sizeof(szInput), InFile1) == NULL)

{

nFileOneStatus = END_OF_FILE;

}

else

{

if (strcmp(szInput, szInput1) != 0)

{

strcpy(szInput1, szInput); nFileOneStatus = ALL_OK;

}

}

 

break;

 

case ALL_OK:

/* Nothing needed */

break;

 

case END_OF_FILE: /* Can’t do anything */ break;

}

switch(nFileTwoStatus)

{

case NEED_RECORD: /* Read a record */

if (fgets(szInput, sizeof(szInput), InFile2) == NULL)

{

nFileTwoStatus = END_OF_FILE;

}

else

{

if (strcmp(szInput, szInput2) != 0)

{

continues

339

Part II • Managing Data in C

Listing 10.3. continued

strcpy(szInput2, szInput); nFileTwoStatus = ALL_OK;

}

}

break;

case ALL_OK:

/* Nothing needed */

break;

 

case END_OF_FILE: /* Can’t do anything */ break;

}

}

if (nFileOneStatus == END_OF_FILE)

{

if (nFileTwoStatus != END_OF_FILE)

{

fputs(szInput2, OutFile); nFileTwoStatus = NEED_RECORD;

}

}

else

{

if (nFileTwoStatus == END_OF_FILE)

{

if (nFileOneStatus != END_OF_FILE)

{

fputs(szInput1, OutFile); nFileOneStatus = NEED_RECORD;

}

}

else

{

nCompare = strcmp(szInput1, szInput2); if (nCompare < 0)

{/* File one is written */ fputs(szInput1, OutFile); nFileOneStatus = NEED_RECORD;

340

Data Management: Sorts, Lists, and Indexes

C C C

 

10C

 

C C C

 

C C

}

else

{

if (nCompare > 0)

{/* File two is written */ fputs(szInput2, OutFile); nFileTwoStatus = NEED_RECORD;

}

else

{/* They are the same; write one and discard the other. */

fputs(szInput1, OutFile); nFileOneStatus = NEED_RECORD; nFileTwoStatus = NEED_RECORD;

}

}

}

}

}

fclose(InFile1);

fclose(InFile2);

fclose(OutFile);

return (0);

}

Purging duplicate records from a single file is not difficult. First the program reads a line. Then the program discards the line if it is the same as the previous line, or saves the line if it is different from the previous line. PURGFILE performs a merge and a purge at the same time, however, making the program a bit more complex.

To use PURGFILE to purge a single file, you simply specify the same name twice or specify NUL: as the second filename. (A second filename must be specified to provide the output filename.)

The flowchart in Figure 10.2 shows how the PURGFILE program works. The program does not use advanced techniques, so this section looks only at the flowchart, rather than each line of code.

341

Part II • Managing Data in C

Figure 10.2. The flowchart for PURGFILE.C.

342

Data Management: Sorts, Lists, and Indexes

C C C

 

10C

 

C C C

 

C C

As you can see in Figure 10.2, the program begins by opening the two input files and the output file. If there are no errors in the file-open stage, the program reads a record from each file (assuming that the program should read a record and that the program has not reached the end of the file).

After the records are read, the program makes its comparisons (taking into consideration possible end-of-file conditions), then writes the correct record. When the program has the same record from both files, it discards the second file’s record, sets the flag indicating that it needs a new record from the second file, and saves the first file’s record.

When the program reaches the end of both input files, it closes all the files and ends. It is a simple program that works quickly.

When you write a purge function, remember that a record might be repeated many times. When your program finds a duplicate and therefore reads a new record, it still must test to be sure that it has read a unique record. The program might be reading a third duplicate, for example, that must also be discarded.

Sorting, Merging, and Purging All in One

Usually, a single utility offers sort, merge, and purge functions. This type of utility will have one or two input filenames, sort the files, purge the duplicates, and provide a single output file.

A variation of a sort program is a sort that works on a file of any size. The process to create the ultimate sort follows:

1.Read the file, stopping at the end of the file or when there is no more free memory.

2.Sort this part of the file. Write the result of the sort to a temporary work file.

3.If the program has reached the end of the file and there are no more records to read in, the program renames step 2’s work file to the output file’s name and ends the program.

4.Again read the file, stopping when there is no more free memory or when the end of the file is reached.

343

Part II • Managing Data in C

5.Sort this part of the file. Write the result of the sort to a second temporary work file.

6.Merge the file created in step 2 with the file from step 5. Delete both of the files created by steps 2 and 5, and rename this new file using the name from step 2.

7.Go to step 3.

Linked Lists

A linked list is a group of data objects in which each object has a pointer to the next object in the group. Everything that you do with linked lists can be performed in memory or as part of a disk file.

Sometimes, sorting the data externally to the program (using the DOS SORT program) is not enough. When a user is entering data, it is never acceptable to stop the program, exit the program, run a sort, create a sorted file, then start the program again.

We have become accustomed to having the computer do the work for us, and rightly so. A program should not require the user to do anything that the program can perform without the user’s intervention.

There are alternatives when data must be sorted. For example, when the user enters an item, the program can pause and use the qsort() function to insert the new item into the current database. If the database is large, however, the pause could be so long that you could go get lunch! Even a simple insert at the beginning of a list can be time consuming—every record in the database must be moved. The size and number of these records can be the critical factor.

Many programs must present the user’s data in a sorted format. Because speed is critical, sorting each time the data is displayed usually is unacceptable—the data must be stored in sorted order.

Many programs work to keep as much of the user’s current data as possible in memory. Searching a large quantity of data in memory should be not only quick, but instantaneous! If the data is not well organized, the search must be linear (record after record). On average, the program must look at half the records to find a matching record, assuming that the records are stored randomly.

344

Data Management: Sorts, Lists, and Indexes

C C C

 

10C

 

C C C

 

C C

In general, a linear search of a block of data or sorting after a data item has been added or edited is too slow and therefore inadequate.

The program’s data must be organized better than the order in which it was entered. One way to organize is to use a linked list. In a linked list, you start with a pointer that points to, or identifies, the first member of the list. Each member (except the last) has a pointer that points to the next member in the list. The last member’s pointer is a NULL pointer to indicate the end of the list. Often there is a separate pointer to the last member in the list—this enables you to add to the end of the list. A single linked list is shown in Figure 10.3.

Figure 10.3. A single linked list.

When you add a new member to a linked list, the program simply follows the list until it finds the member that will precede the new member and inserts the new member at that point. When the program must display sorted data to the user, it uses the linked list pointers to find the necessary data. Because the links are already sorted, the program’s performance is fast.

Using Dynamic Memory

Often you must rely on dynamic memory allocation (memory allocated using one of the memory allocation functions) because you cannot tell how much user data will be provided by the user. When allocating memory, the program must track each block of

345