COMPUTER PROGRAM FOR COMPRESSING OF FILES OF DIFFERENT TYPES
0. Introductory remarks
This idea was realized by me before about 20 years, when I worked still on 286 computer, in DOS, and using translator of Pascal from 1986, if I am not wrong. And the program worked, and still works by me, but when all began to use one Winzip then in the end I also moved to it. But I have written it, for one thing, in order to use better the diskettes, my files could not fit in them and was necessary to split them in parts, and, for another thing, in order to experiment my hashing, because in this translator was impossible to allocate more than 48 KB for all variables, and I needed one array of at least 64 KB, but better twice bigger, and as many for other arrays. Id est this was a challenge for me, but not only this. I decided to use one directly
anti-scientific idea, to make compressing
without analyzing the type of the file (they are so many, and all the time new ones emerge, I did not intend to study them all, as well not to compete with branded products), but compress the file till the size which ... the very God (or the nature of the file) allows by coding of the characters!
I will explain this further, but at the moment want to add that, in spite of all possible difficulties with the hashing (I think that have defined this main array of the order of several KB, i.e. at least ten times less than really necessary), the program worked pretty good, for text files gave compressing of about 45% (and Winzip gave somewhere about 55%), and for some other types where are images even to 15% of the main file. What isn't good, however, is that it works
terribly slow, because makes several iterations, reads everything byte after byte, compares them, forms some arrays, then processes these arrays with the information from the entire file, and if the file is of the order of 100 - 200 KB (as it was in the beginning in DOS) then everything finished for mere seconds, but it the file was, say, 2 MB, then sometimes was necessary a whole ... hour (the dependence on the length of the file is probably exponential, but in no way linear). So that I, as far as I remember, informed, either an office of IBM, or some other company, but decided that this is enough -- the program is mine, it works, but I don't intend to spread it further, because nowadays the files grow very fast, and if this is video information, then who knows how long the program will work.
All this is so, but ... Now see, by me there are often "buts", especially by nontraditional approaches. So that here the point is that if there was no hashing, if I could allocate a whole MB for variables, and even more, make virtual arrays for the files (by, say, a pair of MBs), and perform block reading in them, then this could have quietly reduced the time on a whole decimal order, if not more. But this is not all, because the output file by me looked like ... after meat mincer, I changed the
very bytes, formed my own "bytes" (or words, putting it more correctly), and this could have given very good ciphering (encryption)! Because people still search for any possibilities for better ciphering, it turns out that this is so for providing of better security of bank operations, where are used over-large
prime numbers (say, with several hundreds or even
thousands of digits, where the finding of their prime factors is quite time-consuming operation, but the checking is nearly trivial (via multiplying, although not of the usual floating point, but integer with unlimited number of digits).
So that, see, I think that there is still some "bread" (as the Bulgarians like to say) in my "crazy" idea, and it is better to explain it in general outline, to bring it to the knowledge of all who is interested, than to take it with me to the grave. Only that I will on the purpose not look at the exact realization but narrate from memory. This will preserve some know-hows for me, but if the program has to be transformed for new programming environment then it, anyway, must be rewritten (i.e. written anew). And also such method of presenting of dry professional matters will make them accessible to all, how it has to be on a literary site. Well, this is enough as introduction.
1. My idea about compressing of the file to the allowed by its very nature
Everything began with this, that I noticed that in one file are used far from all possible symbols. For example, if this is text file in Latin, then usually remain about 50 possible characters (and let my shorten this to chars), and even if it is in Cyrillic (in DOS, of course, where the char table is with 256 chars, but in the first Windows, too) also remain about 10-20 chars; only .exe files use nearly 100 percents of all possible chars. These chars can be used. How? Well, for coding of
consequences of chars, and exactly the most often used ones! This always is made in my zero pass. But this, obviously, is a drop in the ocean, I do this so, in order not to leave anything unused. Further I simply
extend the size of the bytes and make my own "bytes", by 10 bits, or by 12 (or more, yet the experiments have shown that 16 bit "bytes" have quite big dimensions, for to be possible to save something with their usage). If the new bytes by me are of 10 bits, then if the first 2 bits are "00" then these are the old chars, and there remain also 3 variants, i.e. 3
char tables for coding! This is something decent. Now it remains only for each given file to find what more has to be added, what new symbols must be formed, and add by one such table in every pass, analyzing consequences of old and new symbols. But you sea already that the bytes are shifted, and it is not known what every real byte means, so that repeating of some bytes does not mean real repeating of symbols, here the things are mixed.
So, and what to add? Well, this what is necessary for each individual file, what occurs most often in it, roughly speaking. In theory I wanted to do analysis of
all possible consequences of symbols, at least in reasonable limits, i.e. first of 2-byte consequences (what does not mean to divide the file in couples of bytes, but of all possible such couples, i.e. n and n+1, n+1 and n+2, etc. if necessary), then of 3 bytes, then of 4, and maybe of five. After this somehow to weigh the number of meets according with their length and choose the most often met consequences, which are to be coded adding new symbols for them, but one such symbol now will mean two (or more) initial symbols. Do you get in what the economy can hide, and why it depends on the very file, not only on its type but on the concrete file? And really, by the hashing I have inserted intermediary prints for maximal repetitions of the pairs and in the beginning they were, seems to me, several thousands (if I am not wrong, because this can be so if I used 2-byte numbers for counters, but maybe I have used 1-byte ones), and in the end of new char table they were about 10 times less.
Soon, however, I decided that this is unnecessary complicated and in this way I will increase the time on a pair of decimal orders (!), so that I have left only the 2-byte consequences, but in recompense of this the existence of several passes is obligatory, so that by 3 passes, if in the file are met, say, series of spaces, then in the end I will substitute 2^3 = 8 such spaces with one new symbol. In addition to this, with some approximation, a substitution of two adjacent symbols several times must be equivalent to substitution of a group (4-5) of symbols. Because of this the number of passes is necessary, it is not good if in one only pass I have added all possible chars.
So, and how, after all, the consequences must be analyzed? By usual counting of the number of their occurrences, of course, so that, in principle, if we want to leave the hashing at all, we must maintain an array with dimension of 2*newbytelength what in the most used by me variant (because the experiments have shown that so it is better) of 10 bits would have given dimension of 20 bits, i.e. 1MB, and also by 2 bytes for the integers, this makes 2MB. The dimension of new "bytes" must be used also for the old bytes, in order that it was possible by the second pass to use, together with the old bytes, also the new, elongated "bytes" representing now 2-byte consequences. In this situation it turns out that my hashing, really (I have said that I specially don't want to look how it was in reality), has used less than one percent of the real arrays, and for this reason the consumed by me time is so overwhelming, as well also the economy is not so big (i.e. if the beginning of the file is
not much representative for the entire file then the hash-table will become very soon full and further even if there appear more often met combinations they will be simply skipped, there is no more space for placing of counters).
Further, even if it is possible to allocate space for arrays with dimension of a pair of MB, this is hardly reasonable to be done, because these arrays have to be all the time in the operating memory, otherwise we will save nothing regarding the time, and there are necessary also virtual arrays for the very files (for portions of them, for some buffers), because they are thoroughly scanned, byte after byte, and this in the new limits and sizes of bytes. So that some hashing will maybe always be necessary, but so, where about one quarter of all possible combinations, not of ridiculously small portions.
2. Forming of the new files
When is accumulated sufficient statistics for the number of meets of all consequences of 2 new bytes, then this array is ordered and are separated the first 256 pairs from it, in order of diminishing of their use. If I am not wrong I don't order the entire array, but find 256 times that pair which occurs most often than the others (search of maximum value of the field of counter) and write it in a temporary table of new symbols. When is known what must be changed I once more time go though the entire file and reading all "bytes" I substitute them or not with new and rewrite them in the new variant of the file, only that before all this I put the very table with the new symbols, and a pair of other indicators, like the number of cycle. I manage with some trick to use only two files, alternating them, and read from the one and write to the other, and then, in the next pass, vice versa. Well, and it is necessary to pay special attention to the complementing of the new file to whole number of bytes (with usual zeros), so that to avoid possible error by reading of the new non-byte "bytes". Naturally there are kept also the tables for all i-1 iterations (either after the current one, or in the end of the file), but each iteration works with "bytes" of the new length. It has no sense to provide possibility to finish with at least one iteration earlier, because the place for the new table is established in advance. In this way, after executing of the given iteration, the files change their places, the arrays are zeroed, it is read again the just written file, its symbol table in the beginning is detached, is rewritten in proper time in the new file, and this main part of the file without the tables is analyzed anew on the frequency of occurring of two consecutive new "bytes", in the same way like in the previous iteration, then again is formed new table, the file is coded again, and is written.
Exactly the information in the beginning of each iteration of the file, containing the last table with 256 new signs, is used during the deciphering or decoding of the file, by decoding program, that works significantly faster, because it only reads portions of bits and, either substitutes them with new consequence, or leaves them as they are. So that the time of working is not such a significant factor, because by the deciphering (or expanding) of the file the work goes pretty fast. But as far as here are always added new 256 new chars, and as far as the number of the iteration is established by the first bits, then are written only the pair of old symbols (from the previous iteration), and not with which new symbol it has to be substituted (this is unnecessary).
If, though, these tables are written on a separate file, then they can be used also for encrypting of the file, what we will consider in the next subsection.
3. Variant of encrypted compression
Because without the tables of new symbols even the devil, as is said, cannot understand anything in the file it is clear that if they are sent separately the security of compressed file can be preserved. Yet I have in mind not such security where one program (the decoding) can, after all, decode various files (having in disposition their new tables), but new
personal ciphers for every user! This can elementary be done if before writing of the table is performed simple swapping of several bytes. As far as the bites are 256 or 2 hexadecimal digits then record of, say, 6FA7 will mean that 111-th element of the table (a pair of symbols, with the necessary number of bits, for the given iteration) has to be changed with its 167-th element, and vice versa. This is really a cipher so that the decoding program can be made universal and the tables with symbols is not necessary to write on separate files, but these swappings, for each iteration, are to be sent secretly, not by email, but in a separate envelope, something of the kind. Then the decoding and decompressing program can simply ask strings of hexadecimal numbers, say, by 4 couples for iteration or 12 such quadruples of digits like the cited above, and when one enters them (or copies from some file on his computer), then everything will be OK, but if he does not enter them, then will be taken that there are no swappings and in this case will be got some bit-mince, as I said in the beginning.
Possibilities for changing, in theory, and for 10-bit "bytes" and 3 iterations, are even not 3*256 but much more, because can be done several swappings involving one of the bytes, what will result in cyclical changings. Generating of ciphers in random way, (for example, making a program that starts and all the time generates random hexadecimal numbers until someone hits the necessary key) will allow absolute uniqueness of these ciphers. The only minus of this method will be this, that to remember these consequences will be also difficult and one will be forced to look somewhere, what means that the cipher can be found. But it is always so with the ciphers, isn't it? I have not worked with such things but suppose that it is so. Another variant is working with current code from some book, if there will be established some method for converting of letters (from every alphabet) in hexadecimal numbers, what isn't so difficult (say, the remainders modulo 16 is the simplest way -- this, that will be some repetitions, as I said, does not matter). And if these are video files then the compression will become on a decimal order, if not more, and the time for decoding, on contemporary computers, will not be significant.
But -- this is my favorite word, at least in this material -- if the char tables are already 2-byte (I am not 100 percent sure in this, but "feel in the guts" that this is so), and for video information also are used for long time words longer than a byte, then it is possible also another alternative method of working, what we will discuss in the next point.
4. Coding in new character tables
In the beginning I want to remark here that if the tables are already 2-byte then even by the classical variant is worth already on the zero pass to make swapping of bytes by a 2-byte word, i.e. so that the bytes alternated: I II, II I, I II, II I, ... . In this way, as I suppose, a massive compression will be obtained in the 1st bytes, and the very tables will be in the 2nd bytes. This can lead to significantly higher compression of the files, and if they are
not 2-byte (what can sometimes happen, I am speaking about any types of files), then this will not hinder the program, simply the comparing will be done by one byte. I personally think for a long time to try this in my program, but can never find the necessary time -- ever since I have begun to publish myself on the Internet, as well also to translate myself, I have no time to spent on programmer work (more so with translators from 86th year of the last century).
But in this subsection I mean that, when the tables are already such extended, then there is no need in the extending of the words (at least for text files, but I suspect that also for video ones, because by the coding of colour in 24 bits it turns out that the words are even 3-byte), in this case they are simply very slightly used and can happen that there leaves place for thousands and more new symbols! How to find this? Well, it is needed simply to work with 2-byte words, and the whole analysis to make on the level of 2 bytes, what will force with itself the necessity of compulsory hashing and work with words twice longer than here. This will at once increase the dimension of the task, but, as I have said, this might not become necessary, if the simple trick with swapping of bytes by a word will be done. This must be experimented somehow.
And everything depends on the size of the file. If they are by MBs and more, and video files, then although the adjacent points will be very similar, yet, in a big file could be used all possible colours. But maybe for audio files will be not so many variants? The important thing in my opinion is to choose one universal method, in order that it can be possible to be applied in all cases with quite good results, not to seek for perfection; then also the ciphering will make sense. Ultimately can be set some limit (say, of 1, or 10 MBs) and each file to be processed in such portions and coded separately, although saved in one file.
Well, there are variants, and the program can be modified, but the significant moment is that I propose quite working, and not badly, idea, there exists precedent. Only that I personally don't intend to interrupt my literary and translating activity, so that don't have in mind to spend much time on programmer experiments. If some company takes into its head that I will be needed for them, then let them first send me 2,000 euros only for the realized idea, and we shall see further.
Dec 2014