Prev Up Next Index
Go backward to 10 Data Checking
Go up to 10 Data Checking
Go forward to 10.2 Example

10.1 Generating the Summaries

 

A variable summary file (or list file), which contains histogram information showing the variable's distribution in the data file, is created for each variable (or designated variables) in the specified data file. You can optionally specify an output file in which a summary of processing activity is saved.

Variable summaries (list files) can be helpful for performing quality control checks of data. For example, you could run checkvar on an ASCII file, convert the file to binary, and then run checkvar on the binary file. The output from checkvar should be the same for both the ASCII and binary files. You can also use variable summaries to look at the data distribution in a data set before extracting data.

The checkvar command has the following form:

checkvar input_file [-f format_file] [-if input_format_file] [-of output_format_file]

[-ft "title"] [-ift "title"] [-oft "title"] [-b local_buffer_size] [-c count] [-v var_file] [-q query_file] [-p precision] [-m maxbins] [-md missing_data_flag] [-mm] [-o processing_summary]

The checkvar program needs to find only an input format description. Output format descriptions will be ignored. If conversion variables are included in input or output formats, no conversion is performed when you run checkvar, since it ignores output formats.

For descriptions of the standard arguments (first eleven arguments above), see Section 8.6.

-p precision
Option flag followed by the number of decimal places. The number represents the power of 10 that data is multiplied by prior to binning. A value of 0 bins on one's, 1 on tenth's, and so on. This option allows an adjustment of the resolution of the checkvar output. The default is 0; maximum is 5.

NOTE: If you use the -p option on the command line, the precision set in the relevant format file is overridden. The precision in the format file serves as the default.

-m maxbins
Option flag followed by the approximate maximum number of bins desired in checkvar output. The checkvar program keeps track of the number of bins filled as the data is processed. The smaller the number of bins, the faster checkvar runs. By keeping the number of bins small, you can check the gross aspects of data distribution rather than the details. The number of bins is adjusted dynamically as checkvar runs depending on the distribution of data in the input file. If the number of filled bins becomes > 1.5 * maxbins, the width of the bins is doubled to keep the total number near the desired maximum. The default is 100 bins; minimum is 6. Must be < 10,000.

NOTE: The precision (-p) and maxbins (-m) options have no effect on character variables.

-md missing_data_flag
Option flag followed by a flag value that checkvar should ignore across all variables in creating histogram data. Missing data flags are used in a data file to indicate missing or meaningless data. If you want checkvar to ignore more than one value, use the query (-q) option in conjunction with the variable file (-v) option.
-mm
Option flag indicating that only the maximum and minimum values of variables are calculated and displayed in the processing summary. Variable summary files are not created.
-o processing_summary
Option flag followed by the name of the file in which summary information displayed during processing is stored.

Tom Sgouros and James Gallagher, 2006-02-12

Prev Up Next