this post was submitted on 27 Sep 2023
121 points (97.6% liked)

Technology

35167 readers
107 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago
MODERATORS
all 22 comments
sorted by: hot top controversial new old
[–] [email protected] 42 points 1 year ago* (last edited 1 year ago) (4 children)

Say, if you compress some data using these LLMs, how hard it is to decompress the data again without access to the LLM used to perform the compression? Is the compression "algorithm" used by the LLM will be the same for all runs (which means you probably can reverse engineer it to created a decompressor program), or will it be different every time it compress new data?

I mean, having to download a huge LLM to decompress some data, which probably also requires GPU with big VRAM, seems a bit much.

[–] AbouBenAdhem 27 points 1 year ago (1 children)

Skimming through the linked paper, I noticed this:

Scaling beyond a certain point will deteriorate the compression performance since the model parameters need to be accounted for in the compressed output.

So it sounds like the model parameters needed to decompress the file are included in the file itself.

[–] [email protected] 7 points 1 year ago (1 children)

So, you'll have to use the same LLM to decompress the data? For example, if your friend send you an archive compressed with this LLM, then you won't be able to decompress it without downloading the same LLM?

[–] [email protected] 6 points 1 year ago (1 children)

This is not dissimilar to regular compression algorithms. If I compress a folder using the 7zip format (.7z) the end user needs to use 7zip to decompress it since it is a proprietary algorithm. (I know Windows 11 is getting 7zip support)

[–] [email protected] 6 points 1 year ago* (last edited 1 year ago) (2 children)

Except LLMs tend to be very big compared to standard decompression programs and often requires GPU with adequate VRAM in order to work reasonably fast enough. This is a very big usability issue IMO. If decompression can be done with a smaller and faster program (maybe also generated by the LLM?), it can be very useful and see pretty wide adoption (e.g. for future game devs who want to reduce their game size from 150GB to 130GB).

[–] [email protected] 3 points 1 year ago

I don't know how this would apply to decompression models in actuality, but in general, deep learning is VRAM intensive only during the training process, that's because they train multiple batches of data at once for generalization, and all those batches of data need to be stored in ram.
But once the model is trained, the end user is only going to input data one by one, so VRAM usually is not an issue. There are also light weight models that are designed to be run on lower end hardware.

[–] [email protected] 2 points 1 year ago

Training tends to be more compute intensive while inference is more likely to be able to be ran on a smaller hardware foot print.

The neater idea would be a standard model or set of models, so that a 30G program can be used on ~80% of target case, games and video seem good canidates for this.

[–] YellowBendyBoy 22 points 1 year ago (1 children)

It probably is more like the LLM is able to „pack the truck much more efficiently“ and decompression should be the same.

But I agree that the likely use-case of uploading all your files to the cloud, having it compress your files, and downloading the result which is a few kb smaller isn’t really practical time efficient or even needed at all.

[–] [email protected] 11 points 1 year ago (2 children)

Correct me if I'm wrong, but don't algorithms like Huffman or even Shannon-Fano code with blocks already pack the files as efficiently as possible? It's impossible to compress a file beyond it's entropy, and those algorithms get pretty damn close to it.

[–] [email protected] 9 points 1 year ago* (last edited 1 year ago) (1 children)

We're likely talking about lossy compression here

[–] [email protected] 15 points 1 year ago (1 children)

That was my first thought as well, but it doesn't seem to be the case:

In their study, the Google DeepMind researchers repurposed open-source LLMs to perform arithmetic coding, a type of lossless compression algorithm.

[–] [email protected] 5 points 1 year ago

... and this is why I should actually read the articles before commenting lol

[–] [email protected] 5 points 1 year ago* (last edited 1 year ago) (1 children)

Correct me if I’m wrong

Well actually, yes, I'm sorry to have to tell you are wrong. Shannon-Fano coding is suboptimal for prefix codes and Huffman coding, while optimal for prefix-based coding, is not necessarily the most efficient compression method for any given data (and often isn't).

Huffman can be optimal given certain strict constraints, but those constraints don't always occur in natural/real- world data.

The best compression method (whether lossless or lossy) depends greatly on the nature of the data to be compressed. Patterns and biases can make certain methods much more efficient (or more practical) in some cases, when they might be useless elsewhere or in general. This is why data is often transformed before compression, using a reversible transformation that "encourages" certain desirable statistical characteristics in the data, so the compression method can better exploit them.

For example, compression software (e.g. gzip) may perform a Burrows-Wheeler transform and other encodings before applying Huffman coding to get a better compression ratio. If Huffman coding was an optimal compression method for all possible data, this would be redundant! Often, E.g. in medical imaging, audio/video data, the data is best analysed in a different domain to better reveal the underlying patterns and redundancies in the data so they cam be easily exploited by compression. E.g. frequency domain instead of time/spatial domain.

[–] [email protected] 3 points 1 year ago

No need to be sorry, I am well aware I can be wrong, and I prefer to learn something new than being bashed for being wrong.

Maybe I phrased it in a way different than I thought about it. I didn't mean to claim that Shannon-Fano or Huffman are THE most efficient ways of doing it, but rather that comparing it to the massive overhead of running a LLM to compress a file, the current methods are way more resource efficient, even one as obsolete as Shannon-Fano codes.

I should probably have mentioned an algorithm like LZMA, or gzip, like you did.

[–] [email protected] 37 points 1 year ago

This is totally going to turn into another JBIG2 lossy compression clusterfuck isn't it...

For those who are unfamiliar, JBIG2 is a compression standard that has a dubious reputation for replacing characters incorrectly in scanned documents (so 6 could become an 8, for example) leading to potentially serious issues when scanning things like medical and legal documents, construction blueprints, etc.

[–] flames5123 32 points 1 year ago (1 children)

Nice, but what’s the Weissman score?

[–] [email protected] 26 points 1 year ago (1 children)

So piedpiper company actually going to start

[–] [email protected] 4 points 1 year ago

It'll be interesting to see if this gets used in places where the cost of dedicated hardware out ways the bandwidth available. Video calls to Antarctica, shipping vessels, airplanes, space, etc. At least that's something that comes to mind. Could also see a next interation of CDNs using it, if the numbers check out.