lookimetrics.blogg.se

#Lzip vz xz update#
#Lzip vz xz zip#

Contact To respond to this post, send me an email at the extent possible under law, Brendan Long has waived all copyright and related or neighboring rights to this work.

#Lzip vz xz zip#

Note: Sorry about the capitalization inconsistency between ZIP and tar's names, but PKWARE calls the format ZIP and tar is consistently listed in lower-case, even at the start of sentences. Just like gzip and bzip, xz and lzma can only compress single files (or data streams) as input.

#Lzip vz xz update#

Because of some optional padding in the file format, you can also update one piece of an XLSX file without having to rewrite the entire ZIP archive.

So going back to Excel, it seems like the reason they chose a more complicated file format is that it lets them get the best of both words: They deduplicate manually, but as a result of using ZIP, you can access a single sheet of an Excel workbook without having to read the entire (possibly large) file.

ZIP seems to do about the same as gzip on compression, and given its superior random-access, it seems strictly better then tar + gzip.

Compressing a tar file with three copies of our file is almost exactly the same size as just compressing the file by itself.

On the other hand, xz does notice the duplicatation and completely eliminates it.

gzip's dictionary size is just too small to deduplicate this size of file, and given that the file isn't particularly large, I suspect it's very rarely going to significantly outperform ZIP.

Copiesįor ZIP and gz, I used the -9 option to increase compression, but it didn't seem to have any effect with xz. In each archive, I put 1-3 copies of the file, and then for tar I applied several kinds of compression. I created several archives using /usr/share/dict/words (a list of ~470,000 English words). Depending on how big your archives are, it may be useful to manually sort similar files to be close together (for example, group all text files together). For example, gzip can't find duplicates more than 32 KB apart, but xz can (max is something like 768 MB apart). One caveat of tar's better compression is that this depends on the compression algorithm and the size of the files.

Unfortunately, being one continuous compressed file means that if you want to read the last file in a tar.gz, you have to read and decompress the whole thing. On the other hand, tar files can get automatic deduplication because gzip and xz see the entire tar file as one continuous file. The advantage of ZIP is you have random access to the files in the ZIP, without having the decompress the whole thing, but as a side effect, files don't share their compression dictionaries. The main difference between the two formats is that in ZIP, compression is built-in and happens independently for every file in the archive, but for tar, compression is an extra step that compresses the entire archive. This seemed strange to me, since I would expect the compression algorithm to do this kind of work for you, but thinking about it made me better understand why that's necessary, and also what the advantages and disadvantages of the ZIP and tar + compression formats are. I've been working on an OCaml library to read XLSX files, and something I thought was odd is that all strings in an Excel workbook are listed in a "shared strings" file and then referenced by index.