
Contact To respond to this post, send me an email at the extent possible under law, Brendan Long has waived all copyright and related or neighboring rights to this work.
#Lzip vz xz zip#
Note: Sorry about the capitalization inconsistency between ZIP and tar's names, but PKWARE calls the format ZIP and tar is consistently listed in lower-case, even at the start of sentences. Just like gzip and bzip, xz and lzma can only compress single files (or data streams) as input.
#Lzip vz xz update#
Because of some optional padding in the file format, you can also update one piece of an XLSX file without having to rewrite the entire ZIP archive.

So going back to Excel, it seems like the reason they chose a more complicated file format is that it lets them get the best of both words: They deduplicate manually, but as a result of using ZIP, you can access a single sheet of an Excel workbook without having to read the entire (possibly large) file.

Unfortunately, being one continuous compressed file means that if you want to read the last file in a tar.gz, you have to read and decompress the whole thing. On the other hand, tar files can get automatic deduplication because gzip and xz see the entire tar file as one continuous file. The advantage of ZIP is you have random access to the files in the ZIP, without having the decompress the whole thing, but as a side effect, files don't share their compression dictionaries. The main difference between the two formats is that in ZIP, compression is built-in and happens independently for every file in the archive, but for tar, compression is an extra step that compresses the entire archive. This seemed strange to me, since I would expect the compression algorithm to do this kind of work for you, but thinking about it made me better understand why that's necessary, and also what the advantages and disadvantages of the ZIP and tar + compression formats are. I've been working on an OCaml library to read XLSX files, and something I thought was odd is that all strings in an Excel workbook are listed in a "shared strings" file and then referenced by index.
