Java text compression  
Author Message
Chris





PostPosted: 2007-11-19 3:45:00 Top

java-programmer, Java text compression What's the fastest way to compress/decompress text?

We're doing that over really large datasets in our app. We're currently
converting char arrays to byte arrays using our own UTF-8 conversion
code, and then compressing the bytes using java.util.zip. The code is
pretty old.

I don't like this two-step process, and the profiler shows that this is
a bottleneck in our app.

Is anyone aware of any code that compresses chars directly? Perhaps a
third-party library that does it faster?

In our particular situation, decompression speed is a lot more important
than compression speed.
 
Joshua Cranmer





PostPosted: 2007-11-19 4:14:00 Top

java-programmer >> Java text compression Chris wrote:
> What's the fastest way to compress/decompress text?

I know of a compression method that can compress in constant time: it
replaces the entire text with a single `0' (hey, it's lossy!). To what
degree of compression do you need?

> We're doing that over really large datasets in our app. We're currently
> converting char arrays to byte arrays using our own UTF-8 conversion
> code, and then compressing the bytes using java.util.zip. The code is
> pretty old.
>
> I don't like this two-step process, and the profiler shows that this is
> a bottleneck in our app.

Questions:
1. From whence are the char arrays coming?
2. Where are the byte arrays going?
3. Which part of the process is the bottleneck? The UTF-8 conversion, or
the compression?

> Is anyone aware of any code that compresses chars directly? Perhaps a
> third-party library that does it faster?
>
> In our particular situation, decompression speed is a lot more important
> than compression speed.


--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
 
Eric Sosman





PostPosted: 2007-11-19 4:16:00 Top

java-programmer >> Java text compression Chris wrote:
> What's the fastest way to compress/decompress text?

If you're really interested in "the fastest way" to the
exclusion of all other concerns, then don't compress at all.
Bingo! Problem solved!

You might be happier with a compression scheme that did
a little better at reducing the size of the data, but now you
can't get a sensible answer until you describe the trade-offs
you're willing to make. For example, if you were offered a
compression scheme that ran ten percent faster than your current
method but emitted fifteen percent more data, would you take it
or reject it?

> We're doing that over really large datasets in our app. We're currently
> converting char arrays to byte arrays using our own UTF-8 conversion
> code, and then compressing the bytes using java.util.zip. The code is
> pretty old.
>
> I don't like this two-step process, and the profiler shows that this is
> a bottleneck in our app.
>
> Is anyone aware of any code that compresses chars directly? Perhaps a
> third-party library that does it faster?

How badly do you need your own idiosyncratic UTF-8 conversion?
If you can use standard methods, consider wrapping the compressed
streams, somewhat like

Writer w = new OutputStreamWriter(
new GZIPOutputStream(...));
w.write("Hello, world!");

BufferedReader r = new BufferedReader(
new InputStreamReader(
new GZIPInputStream(...)));
String s = r.readLine();

You'll have to make your own assessment of the speeds and the
degree of compression.

> In our particular situation, decompression speed is a lot more important
> than compression speed.

"You'll have to make your own assessment ..."

--
Eric Sosman
email***@***.com
 
 
Roedy Green





PostPosted: 2007-11-19 4:42:00 Top

java-programmer >> Java text compression On Sun, 18 Nov 2007 13:44:44 -0600, Chris <email***@***.com>
wrote, quoted or indirectly quoted someone who said :

>What's the fastest way to compress/decompress text?

You can try GZIP. See http://mindprod.com/applet/fileio.html
for sample code. It is simpler. It might be faster.

Note that zip has a compression number where you can trade off speed
for compression.

If you have a limited vocabulary, you might try just looking up words
and converting to 16-bit index numbers.

Each number is a word + a trailing space.

see http://mindprod.com/project/supercompressor.html
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
 
Arne Vajh鴍





PostPosted: 2007-11-19 5:25:00 Top

java-programmer >> Java text compression Roedy Green wrote:
> On Sun, 18 Nov 2007 13:44:44 -0600, Chris <email***@***.com>
>> What's the fastest way to compress/decompress text?

Not quoted:
#We're currently converting char arrays to byte arrays using our own
#UTF-8 conversion code, and then compressing the bytes using
#java.util.zip.

> You can try GZIP. See http://mindprod.com/applet/fileio.html
> for sample code. It is simpler. It might be faster.

ZIP and GZIP uses the same basic algorithm - the difference
is just in the packaging where zip supports multiple files and
store file meta data while gzip does not.

Arne
 
 
Robert Klemme





PostPosted: 2007-11-19 5:37:00 Top

java-programmer >> Java text compression On 18.11.2007 21:16, Eric Sosman wrote:
> Chris wrote:
>> What's the fastest way to compress/decompress text?
>
> If you're really interested in "the fastest way" to the
> exclusion of all other concerns, then don't compress at all.
> Bingo! Problem solved!
>
> You might be happier with a compression scheme that did
> a little better at reducing the size of the data, but now you
> can't get a sensible answer until you describe the trade-offs
> you're willing to make. For example, if you were offered a
> compression scheme that ran ten percent faster than your current
> method but emitted fifteen percent more data, would you take it
> or reject it?

Bonus question for OP: what is the size of data sets and how are they
used? Especially, where are they stored?

>> We're doing that over really large datasets in our app. We're
>> currently converting char arrays to byte arrays using our own UTF-8
>> conversion code, and then compressing the bytes using java.util.zip.
>> The code is pretty old.
>>
>> I don't like this two-step process, and the profiler shows that this
>> is a bottleneck in our app.
>>
>> Is anyone aware of any code that compresses chars directly? Perhaps a
>> third-party library that does it faster?
>
> How badly do you need your own idiosyncratic UTF-8 conversion?
> If you can use standard methods, consider wrapping the compressed
> streams, somewhat like
>
> Writer w = new OutputStreamWriter(
> new GZIPOutputStream(...));

Minor detail: The encoding is missing, so in this case it would rather be

new OutputStreamWriter(new GZIPOutputstream(...), "UTF-8")

> w.write("Hello, world!");
>
> BufferedReader r = new BufferedReader(
> new InputStreamReader(
> new GZIPInputStream(...)));
> String s = r.readLine();
>
> You'll have to make your own assessment of the speeds and the
> degree of compression.

This is the solution I was going to suggest as well. If the custom
encoding yields the same results as Java's built in UTF-8 then I would
immediately switch to this approach of stacked streams. If the custom
encoding yields slightly different results I am sure you can plug in the
custom encoding into the standard Java io and nio classes with a little
effort.

If data is to be stored in a BLOB via JDBC you can even extend this
approach to directly stream into the database.

>> In our particular situation, decompression speed is a lot more
>> important than compression speed.
>
> "You'll have to make your own assessment ..."

Decompression is generally much faster than compression. I believe
there is not much difference in decompression speed when decompressing a
GZIP stream that was compressed with highest and lowest level. If you
dig a bit into compression theory then it's pretty obvious that finding
a small compressed representation is significantly harder than
converting a compressed data set back.

Kind regards

robert
 
 
Chris





PostPosted: 2007-11-19 6:20:00 Top

java-programmer >> Java text compression Joshua Cranmer wrote:
> Chris wrote:
>> What's the fastest way to compress/decompress text?
>
> I know of a compression method that can compress in constant time: it
> replaces the entire text with a single `0' (hey, it's lossy!). To what
> degree of compression do you need?

As with all compression apps, as much as possible. It's hard to pick a
number until you know the available tradeoffs. We already know the
tradeoffs with java.util.zip, so obviously we're looking for something
faster.

> Questions:
> 1. From whence are the char arrays coming?

Memory. They get streamed from an external source, compressed, and
written to disk as they come in.

> 2. Where are the byte arrays going?

Also memory, although ultimately to the UI.

> 3. Which part of the process is the bottleneck? The UTF-8 conversion, or
> the compression?

Both, though the UTF-8 is the bigger burden.
 
 
Chris





PostPosted: 2007-11-19 6:28:00 Top

java-programmer >> Java text compression Eric Sosman wrote:
> Chris wrote:
>> What's the fastest way to compress/decompress text?
>
> If you're really interested in "the fastest way" to the
> exclusion of all other concerns, then don't compress at all.
> Bingo! Problem solved!

Um, actually no. When you're dealing with really large datasets it's
generally faster to compress than not to compress. The reason is that
compressed data requires less disk I/O.

Not always, of course; really processor-intensive compression schemes
can make you processor-bound. But in general, that's the case.

Yup, we've benchmarked it.


> You might be happier with a compression scheme that did
> a little better at reducing the size of the data, but now you
> can't get a sensible answer until you describe the trade-offs
> you're willing to make. For example, if you were offered a
> compression scheme that ran ten percent faster than your current
> method but emitted fifteen percent more data, would you take it
> or reject it?

Yes, but I don't think that knowing that is essential to answering the
question. The question is really about finding better tradeoffs than we
can get with char conversion + zip conversion.
 
 
Chris





PostPosted: 2007-11-19 6:35:00 Top

java-programmer >> Java text compression > Bonus question for OP: what is the size of data sets and how are they
> used? Especially, where are they stored?

Multi-terabyte sized, split across multiple machines. On a single
machine, generally not more than a few hundred Gb. One or two disks per
machine, SATA, no RAID.

At compression time, the data is streamed from an external source,
transformed in memory, and written to disk.

At decompression time, the app seeks to the particular block of text of
interest and decompresses it. Seek time dominates decompression time,
*except* when we do heavy caching, in which case the decompression
becomes the bottleneck. Storing the decompressed text in memory takes up
too much space. Has to be cached in compressed form.

 
 
Eric Sosman





PostPosted: 2007-11-19 7:30:00 Top

java-programmer >> Java text compression Chris wrote:
>> Bonus question for OP: what is the size of data sets and how are they
>> used? Especially, where are they stored?
>
> Multi-terabyte sized, split across multiple machines. On a single
> machine, generally not more than a few hundred Gb. One or two disks per
> machine, SATA, no RAID.
>
> At compression time, the data is streamed from an external source,
> transformed in memory, and written to disk.
>
> At decompression time, the app seeks to the particular block of text of
> interest and decompresses it. Seek time dominates decompression time,
> *except* when we do heavy caching, in which case the decompression
> becomes the bottleneck. Storing the decompressed text in memory takes up
> too much space. Has to be cached in compressed form.

Cutting the text into blocks usually makes the compression
less effective. Most compression schemes nowadays (and I believe
DEFLATE is among them) are adaptive, meaning that they adjust to
the characteristics of the data stream as they process it. Thus,
they compress relatively poorly at first, then improve as they
learn more about the statistical profile of the data.

Implications: (1) You'll get better compression if you can
keep the blocks "fairly long." (1a) You might make the blocks
"long" by concatenating multiple sub-blocks, at the expense of
needing to decompress from the start of a block even if you only
need the sub-block at its end. (2) If the blocks simply must be
small, you probably shouldn't waste effort on BEST_COMPRESSION.

Some interesting experiments are in order.

--
Eric Sosman
email***@***.com
 
 
Roedy Green





PostPosted: 2007-11-19 16:51:00 Top

java-programmer >> Java text compression On Sun, 18 Nov 2007 19:54:10 -0800, email***@***.com (Mark Rafn) wrote,
quoted or indirectly quoted someone who said :

>This will at least get you a baseline, and it'll be fairly easy to use other
>DeflaterOutputStream implementations.

Google points out:

http://teatrove.sourceforge.net/javadoc/com/go/trove/util/DeflaterOutputStream.html

--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
 
rossum





PostPosted: 2007-11-19 18:00:00 Top

java-programmer >> Java text compression On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner
<gneuner2/@/comcast.net> wrote:

>This sounds like it might be DNA sequences.
If it is DNA sequences, then you might be able to use something like
Base64 for initial coding:

A -> 00
C -> 01
G -> 10
T -> 11

AAA -> 000000
...
ACG -> 000110
...
TTT -> 111111

Those six bit codes are a straight match for Base64 coding. That will
compress three bytes "AAC" into one "B" (000001 -> B in Base64).

rossum

 
 
Andrew Thompson





PostPosted: 2007-11-19 19:14:00 Top

java-programmer >> Java text compression rossum wrote:
>>This sounds like it might be DNA sequences.
>If it is DNA sequences, ..

Sounds more like a heap of speculation over a 'cagey' problem
specification.

To clarify the 'cagey' aspect, perhaps the OP can indeed express
what the ulitmate point of this exercise is, rather than doling out
'precious little tid-bits' that supposedly represent the actual problem.

(Ultimately, the OP's own posts identify that they are entirely
unaware of the subtleties of the problems they face, and until
they fill in more detail, it really constitutes an 'interesting but
inefficient use of bandwidth' to speculate further.)

--
Andrew Thompson
http://www.physci.org/

Message posted via JavaKB.com
http://www.javakb.com/Uwe/Forums.aspx/java-general/200711/1

 
 
Roedy Green





PostPosted: 2007-11-19 19:41:00 Top

java-programmer >> Java text compression On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner
<gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone
who said :

>>> Multi-terabyte sized, split across multiple machines. On a single
>>> machine, generally not more than a few hundred Gb. One or two disks per
>>> machine, SATA, no RAID.
>
>This sounds like it might be DNA sequences.

If they are, convert each letter to 3 bits, then pack them 21 to a
long, wasting the sign bit.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
 
Eric Sosman





PostPosted: 2007-11-19 22:45:00 Top

java-programmer >> Java text compression George Neuner wrote:
> On Sun, 18 Nov 2007 18:30:19 -0500, Eric Sosman
> <email***@***.com> wrote:
>
>> Chris wrote:
>>>> Bonus question for OP: what is the size of data sets and how are they
>>>> used? Especially, where are they stored?
>>> Multi-terabyte sized, split across multiple machines. On a single
>>> machine, generally not more than a few hundred Gb. One or two disks per
>>> machine, SATA, no RAID.
>
> This sounds like it might be DNA sequences.
> [...]
>
> Most LZ based implementations (including DEFLATE) limit codes to
> 16-bits (I've heard of 32-bit LZ, but I've never seen it).
> Compression studies have shown that, on average, the 16-bit code
> dictionary will be filled after processing 200KB of input.
>
> If the remainder of the input is characteristically similar to the
> part already encoded, the full dictionary will compress the rest of
> the input pretty well. But most input varies as it goes, sometimes
> rapidly and drastically, so it does make sense to segment the input to
> take advantage of the variation.
> [...]
> If we are talking about DNA sequences, then I would probably go for
> 256KB - once the base nucleotides and amino acid sequences are in the
> dictionary (and you can guarantee this by preloading them),
> compression is typically very good (80+%), so it makes sense to not
> worry about it and just pick a convenient sized buffer to work with.

If you're right, the alphabet is so small that I'd question
the need for the UTF-8 conversion. DEFLATE will quickly learn
that every other byte is a zero, and will compress them very well.

--
Eric Sosman
email***@***.com
 
 
rossum





PostPosted: 2007-11-20 0:38:00 Top

java-programmer >> Java text compression On Mon, 19 Nov 2007 11:41:29 GMT, Roedy Green
<email***@***.com> wrote:

>On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner
><gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone
>who said :
>
>>>> Multi-terabyte sized, split across multiple machines. On a single
>>>> machine, generally not more than a few hundred Gb. One or two disks per
>>>> machine, SATA, no RAID.
>>
>>This sounds like it might be DNA sequences.
>
>If they are, convert each letter to 3 bits,
Two bits, there are only four letters.

rossum

>then pack them 21 to a
>long, wasting the sign bit.

 
 
Chris





PostPosted: 2007-11-20 11:55:00 Top

java-programmer >> Java text compression Andrew Thompson wrote:
> rossum wrote:
>>> This sounds like it might be DNA sequences.
>> If it is DNA sequences, ..
>
> Sounds more like a heap of speculation over a 'cagey' problem
> specification.
>
> To clarify the 'cagey' aspect, perhaps the OP can indeed express
> what the ulitmate point of this exercise is, rather than doling out
> 'precious little tid-bits' that supposedly represent the actual problem.
>
> (Ultimately, the OP's own posts identify that they are entirely
> unaware of the subtleties of the problems they face, and until
> they fill in more detail, it really constitutes an 'interesting but
> inefficient use of bandwidth' to speculate further.)
>

I started to write a long, irritable response to this, but I just don't
have time. Maybe I'll start a new thread later on twerps whose snotty,
fatuous responses waste bandwidth.

I'll say only this:

1. The problem is fully specified. Text compression is a known problem,
and I just asked if anyone knew of a Java library that had better
tradeoffs than UTF-8 + zip.

2. Text means text, not DNA. Written words.

3. I'm fully aware of the subtleties of this particular problem space,
and there is nothing in any of the posts so far that I haven't
considered and already benchmarked. (Including block sizes. For this
app, ~10k is optimal).

4. You don't need to know about the environment or the rest of the app,
because I'm not asking for a damn consultant.

5. Asking "how much compression I want" is just stupid.

In short, a narrow question is an invitation for a narrow answer, not an
invitation for you to tell me how you think I should write this app.
 
 
Eric Sosman





PostPosted: 2007-11-20 12:15:00 Top

java-programmer >> Java text compression Chris wrote:
> [...]
> 1. The problem is fully specified. Text compression is a known problem,
> and I just asked if anyone knew of a Java library that had better
> tradeoffs than UTF-8 + zip.
>
> 2. Text means text, not DNA. Written words.

Elsethread you've explained that the compressed stream
gets read back into a companion program and decompressed there;
this suggests that it doesn't need to be exchanged with "foreign"
programs. In which case, I ask again: Does UTF-8 encoding buy
you enough additional compression to justify its expense? How
bad would things be if you just handed 16-bit chars to the
compressor with no "intelligence" whatsoever?

> 5. Asking "how much compression I want" is just stupid.

Well, you asked about compression speed. Other things
being equal, faster compressors compress less well and "looser"
compressors compress faster, so the question of "how much" must
eventually arise when you weigh alternatives.

--
Eric Sosman
email***@***.com
 
 
Chris





PostPosted: 2007-11-20 12:51:00 Top

java-programmer >> Java text compression Eric Sosman wrote:
> Chris wrote:
>> [...]
>> 1. The problem is fully specified. Text compression is a known
>> problem, and I just asked if anyone knew of a Java library that had
>> better tradeoffs than UTF-8 + zip.
>>
>> 2. Text means text, not DNA. Written words.
>
> Elsethread you've explained that the compressed stream
> gets read back into a companion program and decompressed there;
> this suggests that it doesn't need to be exchanged with "foreign"
> programs. In which case, I ask again: Does UTF-8 encoding buy
> you enough additional compression to justify its expense? How
> bad would things be if you just handed 16-bit chars to the
> compressor with no "intelligence" whatsoever?
>

I'd like to try that. Unfortunately, java.util.zip.Deflater accepts only
byte arrays, not char arrays. I suppose it might be faster to copy the
chars to 2-byte sequences and compress, rather than run the UTF-8
compressor. An extra step, but worth a try.


>> 5. Asking "how much compression I want" is just stupid.
>
> Well, you asked about compression speed. Other things
> being equal, faster compressors compress less well and "looser"
> compressors compress faster, so the question of "how much" must
> eventually arise when you weigh alternatives.
>

Of course. It just reminded of walking into a store and having the clerk
ask "how much do you want to pay?" The right answer is, "show me the
merchandise and I'll figure out what the tradeoffs are on my own".
 
 
Andrew Thompson





PostPosted: 2007-11-20 13:02:00 Top

java-programmer >> Java text compression Chris wrote:
>>> [...]
..
>Of course. It just reminded of walking into a store and having the clerk
>ask "how much do you want to pay?" ...

That might be an appropriate comparison if this were
a help desk. It is not a help desk. (Though you seem
to be treating the responders as though they were your
own personal servants - so perhaps you are confused
about the differences between 'help-desk' and 'usenet').

--
Andrew Thompson
http://www.physci.org/

Message posted via JavaKB.com
http://www.javakb.com/Uwe/Forums.aspx/java-general/200711/1