yesthattom | GoogleCompress

Here’s an idea for a product that Google could make. Since they seem to have a copy of every file in the world, they could make a WinZip clone that has a new compression scheme whereby if the file is already indexed by Google, they replace the entire file with the URL to the file. Huge ZIP files of porn would compress to nearly nothing. They just need a mechanism for the compressor to say, “Hey, don’t delete this file... EVER”. Sure, stuff you’ve written personally won’t compress as well, but right now I’m backing up my Mac and I see that most of the stuff I have is also available elsewhere.

They could also make a backup scheme based on this.

Flat | Top-Level Comments Only

From:

polydad.livejournal.com

Would *you* trust Google to *never* lose something important?

For backing up porn files, sure. Anything else, maybe not.

best,

Joel. Who also wonders what sorts of creative bookkeeping chicanery would result.

From:

awfief.livejournal.com

Uh. . . pictures don't necessarily compress to nothing. Especially porn. In fact, from what I recall, the compression group at Brandeis uses a particularly tricky porn photo (a classic, basically, it's a nekkid woman with a wide brimmed red hat, and there's enough texture and whatnot that it's hard to compress without losing data) to test out compression algorithms.

It's not a bad idea. Then again, first they should implement web hosting. They already practically do that, except for pictures. But if Google went in on the deal with one of the photo sites, they'd rule the world (hey, here's space for your website, and don't worry, we've already copied what we can see to it, so your setup is that much easier. . . and it's only $5 a month).

Meanwhile, google is just a cache of what's out there. So if someone deletes something that's already out there, then google will eventually cycle it out of the cache. . .

From:

jss

So, what about Google + the Internet Wayback Archives?

From:

yesthattom.livejournal.com

That picture would compress just fine. Google has a file with the exact same bits somewhere, I'm sure.

The issue is Google promises not to delete a file from the cache. That's easy to solve if they are getting paid for it.

From:

awfief.livejournal.com

uh...are you talking about compression or cacheing/having a copy? I think I'm lost.

From:

cos

the GoogleCompress he's suggesting is this: Program looks at your file on disk, searches for it in google, finds a match. Aha! Since google has the same file already, all we need to save is the URL, nothing more. Go on to next file, repeat. At the end, you get an archive that has, say, 100 files. 10 of them are compressed using normal ZIP compression. The other 90 are just URLs. When it comes time to uncompress, if you ask for one of those files, the program just fetches it from google, using the stored URL in your GoogleZip archive file.

From:

awfief.livejournal.com

ahhh! enlightenment!

From:

yesthattom.livejournal.com

Yes, thank you.

From:

yesthattom.livejournal.com

Both. Re-read the original post.

From:

kingfox.livejournal.com

Huge ZIP files of porn would compress to nearly nothing.

...but right now I’m backing up my Mac and I see that most of the stuff I have is also available elsewhere.

Mmmmhmmmm.

From:

yesthattom.livejournal.com

MS-Office is a zillion gigs of binaries and helpfiles too :-)

From:

lovingboth

I suggested this, as a university project, back in 1981 or 82!

It wasn't URLs then, but the idea was the same. I got it from the joke about the monks who know all the jokes so well that they can just refer to them by number.

From:

cos

Similar to an example I often use when describing the basic tradeoff of compression to people. Information, quantitatviely, is how much unexpectedness there is in the data. Compression tries to use the fewest bits possible to represent the information. The more the compression program knows about the information ahead of time, the better it can compress - as long as the decompressor knows the same things. It becomes a tradeoff between how much information you stuff into the compress/decompression program, vs. how much information is in the data being compressed.

For example, a general purpose stream compression program for binary data, like ZIP, can compress any stream of bytes you feed it. If you feed it a random stream, it won't compress very much. If you feed it English text, it will, over the course of reading and compressing that data, adapt to compress English text well. But what if you wrote a compression/decompression program that already knew, ahead of time, that it was going to always be fed English text? It could use strategies that work really well for English - for example, it could look for words, rather than bits or bytes, as its basic unit, and already have predefined codes for the most common words and phrases. Such a compressor would do much better than ZIP on English text, but might be entirely unable to compress some other kinds of data.

You can take this tradeoff to the extreme: A compression program optimized to compress exactly one message. The decompression program already knows the message and can spit it out. So it doesn't matter how large the message itself is, it can be compressed to nothing.

The example I tend to use is, how about a compression program for Shakespeare plays? It can only compress Shakespeare plays, nothing else. The decompressor has the text of all the plays. You feed the compressor a play, it compresses it to the integer that has been assigned to that one. Transmit the integer to the decompressor, and it emits the correct play.

That's still an extreme, but it illustrates the opposite side of the tradeoff, opposite from general purpose compression. Image-specific compression methods are another example of the same tradeoff.

From:

lovingboth

Yep, I did a (de)compression routine for someone's "restaurant guide on a palmtop" - simple LZ77 with a 1k window, but if you prefill the window with text you know is going to be there ('Opening times', 'Nearest station' etc etc) then you get much better compression than any general compressor plus lightning fast decompression.

From:

dossy.livejournal.com

This sounds like the idea I used to joke about ... using P2P filesharing networks as a large distributed online backup.

Of course, it's only a joke ...

From:

dglenn.livejournal.com

And a very long time ago (my God, has it really been two decades?!) we used to joke about Usenet backup: back when mail, as well as news, was mostly transmitted via UUCP in a store-and-forward fashion, and we used "bang path" addressing (that is, you needed to know what machines your mail had to go through, not just your friend's username and host-id), it was suggested that you could just mail all the important chunks of your filesystem to yourself via a fairly long path, and it'd take about a week to come back to you (using other folks' disk space along the way). If you lost a file, as long as you didn't need it back today, you'd rest easy knowing it was coming back around again in a few days.

I'm not sure whether anybody ever actually tried it. (If so, I'm sure their neighbours wouldn't have been pleased if they'd figured out what was going on.)

Ah, the olden days, when I was ..!seismo!dolqci!hqhomes!glenn

From:

mycroft.livejournal.com

This is fairly similar to the backup strategy Permabit was working on...

From:

tactisle.livejournal.com

And once GoogleCompress is in full swing, they could inaugurate GoogleErase(tm)!

After all, if you can make someone hang onto information for a price, why not pay the same price to have them remove it and never re-archive it? They get the same money, without all the storage costs. And hey, there'll be a huge market for it in a few short years, as yesterday's AOL-enhanced highschoolers become tomorrow's public-service exposé fodder.

Oooh, just imagine, an eBay-like auction front end, where people can place competitive bids based on how important it is to them that their information live or die! It'd be, like, a totally democratized Ministry of Truth!

From:

grendelgongon.livejournal.com

*guffaw* Love it!

Flat | Top-Level Comments Only

Profile

yesthattom

Tom's Homepage

December 2015

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Page Summary

Style Credit

Style: Neutral Good for Practicality by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Feb. 4th, 2026 07:38 pm

Tom's Journal

Geeks and politics

GoogleCompress

GoogleCompress

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

December 2015

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags