File Formats [page banner]

By Bill Allen

This page reports information found out about some file formats used in the author's programming, but only covers what has been worked with directly and does not try to be complete by any means. Information on parsing DDF files as part of USGS SDTS transfers is given at length on the "SDTS Notes" page.


  What's New


  PullSDTS

  Python Notes
  Tkinter Notes
  Code Samples

  File Formats [*]
  SDTS Notes

  Glossary
  Index


Contents


GZ - gzip

Python has native support for creating and opening .zip and .gz files, but I couldn't find a straightforward explanation in the Python docs or anywhere online about how to use that support for opening gzip files. So here is an answer specifically for pulling a .tar file out of a USGS SDTS .tar.gz file. The original .gz file is not disturbed. The minimum code needed to do the job in Python is...

fn = 'c:/yourdems/30_2_1_1009294.tar.gz'
import gzip
f = gzip.open(fn)            #open read-binary file f with name fn
g = open(fn[:-3],'wb')       #open new write-binary file without .gz suffix
while 1:
    chunk = f.read(1024)     #read a piece of the .gz file
    if not chunk:            #end of .gz file
        break
    g.write(chunk)           #write a piece of the .tar file
f.close()
g.close()

It doesn't get much simpler than that. Note that no attempt is made in this example to read the name of the stored file; the name is taken to be the same as the original xxxx.tar.gz less the ".gz" suffix. It is clearly presumed that only one file--a TAR, is contained in the .gz file (that's what "tarballs" are for, to gzip a single TAR which holds multiple files). Nothing is done with the TAR's original file date/time attribute or any embedded comments. Also, the gzip file's flags and extra flags are not examined. Those functions are not documented in Python (examine /Python/Lib/gzip.py for clues), and some are not supported. Very important to note is that there is no error checking with this quick-and-dirty approach. Neither the stored file length nor CRC32 verification are validated.

Carrying out the same ungzip task in other programming languages should be just as easy, employing an open-source library such as the zlib library used in Python, available for many languages free and unrestricted for use by all, including commercial software developers (see zlib license).

The Python code for the author's PullSDTS utility demonstrates ungzipping for opening .tar.gz files, and others are welcome to borrow what they can use from it.

The gzip home page has a list of ready-to-use utilities. See also "Working with TAR/GZ" on the "MicroDEM for Artists" page.


RIB - RenderMan Interface Bytestream

A RIB file consists of ASCII RenderMan language commands in use for 1) transporting a complete scene and frame description to a RenderMan-compliant renderer, or for 2) storing resources. It is not commonly used for moving objects between 3D applications.

RenderMan at its inception over a decade ago was hailed as "the 3D PostScript," and the comparison is still apt. One a "scene description" and the other a "page description" language, both RenderMan and PostScript literally program their target devices--a 3D renderer or 2D printer, and the user won't see a .rib or .ps file if commands are sent directly to the device. However, "encapsulated" PostScript (.ai and .eps) files are widely employed for graphics interchange, while the RenderMan equivalent--the RIB entity file, has little support. (It shares the .rib extension and is described in Appendix D.2 of the RenderMan Spec version 3.2.)

Resources


Some basic "gotchas"


TAR

TAR (Tape ARchive) files go back to the earliest minicomputers and the Unix operating system in the 1970s. A TAR file at its most basic is a simple concatenation of a group of files in an uncompressed and unchanged form. Compression is completely external to TAR, usually done with gzip.

TARs consist of 512-byte blocks. Each constituent file is preceded by a 512-byte header with the file's name and some other info, all in ASCII. Padding fills out any block that doesn't need all 512 bytes--header blocks, the empty part of a constituent file's last block, and sometimes one or more totally empty blocks.

There is a small catch in working with TAR headers: The file length and date/time are given in ASCII characters, but these are idx=oct01 octal (a base-8 numbering system) rather than everyday decimal base-10 values. The date may not be too important, but you do need to know the file length to find where to stop reading from a file's last 512-byte block in the TAR file, otherwise you will get null values, garbage, or the real trickster--valid-looking repeat info that is just filler and must be ignored.

Example: Looking at the TAR header for the first file, 1168CATD.DDF, in the old 30m SDTS DEM .tar for the Ticaboo Mesa quad in southeastern Utah, one observes two 12-byte values for file size and date/time: 00000003531[null]06457215167[null]. Assigning the first string ("3531") to the variable fs and then using int(fs,8) tells us that the file size is 1,881 bytes. Notice that "8" fills the optional base parameter of Python's int() convert-to-integer function. To get the file date/time, import Python's time module, assign the next string ("6457215167") to variable dt, and you will get this string from time.ctime(int(dt,8)): "Wed Jan 14 20:05:11 1998." That's actually wrong unless you are on a Windows or Unix machine set to Greenwich Mean Time (GMT aka UTC). Compensation must be made for the local time zone, maybe for daylight savings time, and, on the Mac or Amiga, the difference between system epochs also must also be factored in--see "Time for Python" on the "Python Notes" page.

To apply a time to the extracted file with Python requires using the os module's utime(path&name,(accessed,modified)) function, where the accessed and modified times must be of type float. This can be done whether the file is open or closed. So here, shown in simpler fashion than in the PullSDTS code, is how you can apply a file's original date/time to it as you extract it from a TAR, where "header" is a 512-byte TAR header:

import os, time
fn = os.path.join(yourFolder,theExtractedFile)
fn = os.path.abspath(fn)
modTime = int(header[136:147],8)          #read the octal epoch time as an integer
modTime = yourTimeAdjustFunction(modTime) #correct for local time & epoch
currTime = time.time()                    #get the current system time as a float
os.utime(fn,(currTime,float(modTime))     #assign time attribute to the file
Unfortunately, utime() will cause an error on Mac Python 2.0 with OS9.

Unix time: The DDF files' octal time/date stamp uses Unix epoch time, which is simply the number of seconds since 0 hour on 1 January 1970 UTC, the year that Unix first went into use, which was as the DEC PDP-11 minicomputer's OS. (Yes, Unix time is headed for it's own all-out-of-octals Y2K-style crisis in 2038, so start caching your survival supplies again!)

What identifies a TAR header block? A TAR begins with a header that precedes the first constituent file. And a header precedes each additional constituent file in the TAR.

Here is an algorithm for extracting files from a TAR, in Python opened with f = open(tarFilePath,'rb'):

Credit: When I was searching for help in understanding TAR files, this was the best of the few useful pages found, even though it apparently wasn't created for general consumption: www.datafocus.com/docs/man4/tar.4.asp


More Links

These resources are in addition to those listed above under specific file formats. This short list shares some key links found in searches related to the author's own programming efforts, and is not intended to be complete.


File Formats Page News

29 May 2001: RIB file info expanded into its own section.
4 May 2001: This page first posted.


Revised: 4 Jan 02 rev 0
http://www.3dartist.com/WP/formats/index.html
© Copyright 2001-02 Columbine, Inc. - All Rights Reserved
Any mentioned trademarks are the property of their respective owners.