Мирский Христо : другие произведения.

04. Computer Program For Splitting Of Words Of Different Languages

Самиздат: [Регистрация] [Найти] [Рейтинги] [Обсуждения] [Новинки] [Обзоры] [Помощь|Техвопросы]
Ссылки:
Школа кожевенного мастерства: сумки, ремни своими руками
 Ваша оценка:
  • Аннотация:
    This is also what is said, one old and realized by me idea for splitting of words of different languages without any dictionaries, and simultaneously, which worked decently good, yet was for DOS operating system used long time ago. The quintessence was to recognize consonants and vowels and to apply some natural rules (formalized by me) used in many languages.
    Keywords: programming, splitting of words, various languages, heuristic rules, DOS, in English.

      




COMPUTER PROGRAM FOR SPLITTING OF WORDS OF DIFFERENT LANGUAGES


     This is not just an idea, it is realized by me, I have used it for some time and was quite satisfied with it, so that here exists precedent, what is important, because is known that this is possible, not just search for something not knowing whether this can be found or not (like, for example, is the matter with the existence of God). This program has worked, but this was before about 20 years (well, at least 15), and for DOS, and with the emergence of Windows everything become more complicated, and I have ceased to waste time (and to try to find the necessary software free). And as far as in the new platform everything has, in any case, to be made anew, and also I have the right to keep the details to myself, as well as I am publishing this on a popular site, it is not a rule there to give fragments of programs in algorithmic languages, so for all these reasons I will not make efforts to search for the program and look at it precisely but will describe everything from my memory. Nevertheless I am usually verbose, so that if one shows desire, and if he (or she) is programmer, then he will be able to concoct something similar, although in Windows everything is more complicated, there must be worked on level of pixels, not words, must be taken into account all fonts and so on, so that this will be also not so easy. Anyway, I will return to this point at the end, because such possibility is simply obligatory for each browser or screen of mobile computer device (even for a phone, if it has enough memory).
     What is necessary to stress is that the program must work satisfactory good (and it was so by me) for a mixture of languages even in one and the same file, or for the pair of languages with which one usually works, it must allow easy adjusting, if in the given languages are contradictory concepts for splitting of the words, as well also to reflect the preferences of the person, must perform justification to the right border of the page, and other moments. I had a variant that worked stand alone (as a separate program), was given the name of the input file (.txt), was given the name of the output file (.prt) and the program transformed the first into the second, and then remained only to print the latter file. Ah, I have worked only with Cyrillic and Latin alphabets, if one will want to do splitting of words also, say, in Arabic, then there can be another difficulties, but particularly for Greek alphabet should not be special problems (at a first sight -- I have not thought about this). So, but because in various languages people have quite different views to the splitting of words, then I must in the beginning say a pair of paragraphs about this.

     1. Basic notions about syllables, types of letters, and universal rules for splitting

     Here I don't discover America, yet there is some personal contribution of mine to this topic; I touch these matters in my idea about universal world alphabet, and they must be kept in mind, because, precisely speaking, for all, at least European, languages there are simply no universal rules for splitting! In Russian or Bulgarian, they are relatively simple, but in English this is not at all so and there exist dictionaries where is given the splitting for every word, this is mainly based on the syllables, but here also can be different meanings. In German there are their own rules, usually by syllables, but it is wholly inappropriate to split some letters and they use normally 2-3 characters (further I will write only chars) for one letter (consonant), even 4 once; in English (and French), on the other hand, is quite ugly to split the vowels, for the reason that they are read several together. Because of this I asked myself firstly the question: what is this a syllable, with what it is characterized, how it can be distinguished?
     Well, it turned out that it usually, not always but in the most cases (say, in 90% of them), begins with a consonant (which I will shorten to only C., like similarly the vowel will be given only as V.) and ends how it happens, but blessed are the languages where the Vs are simple. And now we come to my views about this, what kinds of V. sounds exist at all (in the world), because we must recognize somehow -- without dictionaries, of course, we can't use dictionaries for all possible world languages -- where the V. is as if one (although there can be several), builds syllable, and where these are several separate Vs. For example: the word "piano" has for the Slavs 3 syllables, but in English "ia" must not be split, this is true also for the word "ready" or for "beauty". Taking this into account I have come to the conclusion that there exist three types of Vs (chiefly, but this can be applied also to the Cs), namely: basic, modified, and diphthongs or combined Vs.
     The basic Vs are only six, they are: "e", "i", "o", "u", "a" and Bulgarian "ъ", which is the same as in English "girl" (I can't indulge in profound explanations here), which I will mark as "ə". The modified are when one wants to say one V. but says another one, like in your "but" which is pronounced as 'bəat', or in the classical (i.e. from Roman times) case of "ae" which is read as 'ae' (like in your "hand") -- and I suppose you have guessed why I am using subscript, as well also that in single quotes I give how the words are pronounced, where in double ones is how they are written. In Russian exist their own modified Vs like, for example, their usual "e", which is pronounced as 'ie' (like their "devka"-girl sounds more like the Ukrainians pronounce it, 'divka'); also their "donkey" sound 'əi' ('ы' in Cyrillic), in 'məi'-we; also their unstressed "o" is pronounced nearer to "a" or exactly (in my opinion) like the English 'əa' (in, say, 'əakno'-window, written as "okno" -- although the alphabet is different, but I have somehow to write it with only Latin alphabet). There is no necessity to go in bigger details because specially in this program I don't do such analysis, but this explains why I have accepted my concept of splitting of Vs, which I will give below. And diphthongs, in my understanding, because some people, even linguists, call, for example, French "ai", which is read is pure 'e', also diphthong, are such combinations of Vs where they are several, like in your "pear" read as 'peə'; in Russian and some other languages there are not such Vs except the classical 'aj', 'ej, etc. (and here with "j" I signify the semi-C. "jot", like in English "may" read as 'maj').
     So, and now we come to this that the letters are to be distinguished, there has to be fixed in the program with each letter, for both alphabets (and what has to be done with various "chicks", mainly above the letters, I will also say to the end of the paper), a characteristic of its type, and these types must be three! Id est, Vs, Cs, and undefined, but still letters, not other chars. These undefined letters (if I have not forgotten something, but in all cases this question has to be performed somehow), say, Cyrillic 'j' (or also German, like in their "Johan"), or Latin "h", as well also "w" (especially in English), play the role of modifiers of the previous letter and must be joined with it, or take the same characteristic like for it (yet what is to be done if such symbol is in the beginning of the word I can't recall exactly, but this is not important because one letter must not remain on the line).
     When the syllable is more or less correctly recognized then we can move to performing of the splitting of the word, to separating of the full-fledged syllable, what we will consider in the next point about my realization. Yet I want to stress that this program works, as it is accepted to say, heuristically, i.e. it gives some solution which is sufficiently good, but it is not necessarily the ideal or grammatically correct one. ( For example, it turns out that Latin "ck" has to be read as "kk" and take this into account by the splitting, leaving on both lines by a letter "k", but I think this is unnecessary luxury. )

     2. How I have done the very splitting of the words

     Well, it is started from the end of the word that does not fit by the given length of the line, and working only with chars, going from the end to the beginning of the word, and heeding about the number of chars in both of its parts, is detached the possible syllable (but there may be several variants) and adding the dividing hyphen is checked whether the first part of the word can be placed; if not, is continued further, initially not adding new syllable, and if this is necessary, then beginning to form new syllable, when is checked again could the splitting be performed, and so while is reached the minimum of allowable chars in the beginning of the word (this is 2, but can be required also 3, or that words with less than 5 chars are not split, I usually split also 4-char words) and then the word is left as it is and with it a new line is begun. When the word is processed and is found or not a splitting, then when something is taken from the initial line remains still to expand it to the right end adding by interval between the words again moving from the end to the beginning of the line.
     Naturally here are necessary procedures for recognizing of the end of the word, in phonetical meaning, such like space, punctuation signs, other symbols; and also, I think, to leave such words that are not pure words, i.e. in them exist numbers, or special symbols, or the alphabets are mixed -- they must be left as they are, for this can be passwords, new symbols, and others. It is also natural that is necessary to give the length of line that we have to get (by default, I think, 70), the number of chars that can be left in the end or in the beginning of the line (and if there is found a syllable, because otherwise in continued further), and some other parameters of the program. But is necessary to give some other important things, too, because till the moment we have said only that it is not clear is it necessary to split or not two Vs, or also can be separated consequence of Cs or not, but have not explained how exactly the problem has to be solved, and we can't leave the program to stay and think like the donkey of Buridan from which haystack to begin eating. So that let us now proceed to this.
     Here I have introduced the following heuristic rules: vowels as a rule are not split, if we don't allow some exceptions, and then is necessary to input a list of exceptions (I think it was made so, but am not 100 percent sure, maybe I have inputted list of prefixes ending on a V.); consonants as a rule are split, but must be entered list of combinations of C. which must not be split (here I am convinced that this was so); undefined symbols (neither Vs not Cs) must go with the previous char, i.e. splitting can be done after them. Input in a DOS program of list of things that have to be performed is not so easy, but I have made this (at the worst this must be inputted as variants in the declaration of constants, and perform a new translation). So that to the program must be given a list of non-splitting combinations -- such like Ger. "sch", also "ch", "ck", "sh", "ph", etc. (10 - 15 combinations, not really much) -- and a list of splittable combinations, chiefly of Vs -- say "ea" in German is allowable to split (but "ae" still is not allowed), or maybe we will want to split "eu", or "ia".
     Then the consequence of working of the program is such: if we can't split then we skip it and go further (to the beginning); if can be split (Vs) then we split them and finish; otherwise (what happens most often) we split always equal Cs (in some languages this occurs frequently), split always V-C, split sometimes C-V (but if before this is another C. then split there), never split before undefined char (like I have said), and as if this was all.
     It is good that in DOS don't exist special chicks above the letters, and that all letters are equally wide. But on the screen, as also by printing in Windows or in another contemporary operating system, all these problems exist, so that it is necessary to write the next subsection.

     3. What has to be done when the letters are of various sizes, types, and fonts?

     Here I will mainly guess, because I have no experience in programming in another operating environment, but I think that will guess rightly, i.e. there simply must exist special procedures, which will do the work for you, for the reason that you all know well that when you type something in a file then you write on the given line until this is possible, and when this is already not possible then automatically you happen to be on the next line. That's it. But what are these procedures, where they are and how are written (most probably on Visual Basic), how they are called (i.e. their names and parameters), is unknown to me for the simple reason that nowadays is necessary to pay for everything, no matter whether you do something useful for the others, or want to gain something only for yourself. I have used to work for the others, but when with the coming of democracy I was forced to work only for myself, then I have ceased to work for the others, and have begun to write various things per il mio diletto (for my personal pleasure, in Italian), right?
     But well, so be it, I will tell you, after all, my meaning also in regard to these problems. The program has to be called by hitting of some button on the panel of your editor or browser, must be possibility for its adjustment by entering of new non-splitting consequences, also of new splitting Vs (or prefixes, syllables), as well also of new Vs and Cs, for there can be various alphabets. So that we have come to the question, what has to be done with letters that have something above or below them: the program must recognize these letters, too, no matter how many they may happen to be (and they, in theory, are long ago recognized, because by making of indexes in Word they are treated correctly), and this for every alphabet. Maybe will be necessary for each letter to maintain also belonging to the alphabet, in order to be able to recognize mixed words, but maybe this can be left because there are a heap of languages where there are a pair of symbols added to the main ones (say, "i" in Ukrainian). Then when unknown letter (but not every symbol) is met the program has to process it as undefined letter and allow splitting only after it.
     As to the various fonts and possibilities for adornment of letters that change their size, then for this must be responsible the very operating system, with calling of some function (say, DoesFitInLine (string, linelength) : Boolean).But in order that one was able to correct what the program has done must be a possibility to call it only for the marked area, if such exists, according to the situation on the screen (i.e. to continue the paragraph) and acquire all parameters for it, which are to be taken from the current paragraph (i.e. here are necessary calls of other service functions and procedures of the operating system or the browser). Ah, texts in tables have also to be treated according with their parameters for the paragraphs.
     But, mark this, that however imperfect one such program is, it will be in any case much better than the contemporary situation of the things, on what we will dwell in the next subsection. And if some of you wants to object that in real time and when one can with hitting of a key (or with the mouse) change the width of the screen and this will call reformatting of the whole document, then I will answer briefly: this is not at all significant for today's computers. I personally (till 2015) work on a computer where are only 256 MB of operating memory, can't watch whatever movies, on the hard disc have only 10 GB, from which I keep free about 1/3, the processor speed is about 600 MHz, and by these conditions, when I edit a whole book, of roughly 200 pages, I can perform spell checking for a pair of seconds, and this, positively, is more difficult than my program for splitting of the words. But even if for such processing is necessary time, if it goes about printing this does not matter, and if it is about showing on the screen then it suffices to process the first pair of thousands of symbols until the screen fills, and further to continue in background mode; and, besides, one can calmly bet that the contemporary computers are at least an order faster than my own. So that the point here is not in the slowing but in the lack of desire, and in not paying the necessary attention to the problem.

     4. And is it worth to bother us at all?

     Of course that it is worth, because by all the fantastic power of contemporary computers and software products without such program we as if live in the ... stone age, or at least a thousand of years before Christ, where people chiseled on stones letter after letter, paying no attention neither to intervals, no to punctuation signs, no to paragraphs and pages! It is really so. Even when are sent messages on mobile phones they are placed exactly so, and the words are split where happens without whatever hyphen (say, I could have quietly received the following message on my phone: Hello, Mister M|yrski, you have i|nvoice for used se|rvices for the amo|unt of lv 12.3|4." ... and so on). And even without such program is necessary only accepting of one symbol as "optional hyphen" and its correct interpretation on every terminal device, and all messages are to be written inserting it everywhere where it can be needed -- but till the present day nothing of the kind exists.
     Well, here everything started with the English, where the conditions for splitting are complicated, and not clear to every averagely taken individual, and the words are significantly (probably with 3-4 symbols) shorter than in Russian or German, but this is simply not nice, and about 10 percents unneeded wasting of paper, if the texts are really printed. But I suppose that there is another reason: people in commerce are used to perfection, and when this is difficultly to be got then they say, let us do without this thing (what, in my opinion, is judgement on the level of kindergarten). Even the simplest splitting after the last of consequence of Vs (i.e. before a C.) and splitting of some Cs would have been quite acceptable.
     Yet it is possible an alternative approach, which is in this, that has to be started grammatical analysis, where for each of the words is maintained its whole splitting. But this not only will require working in every language separately, what is not at all universally, and I like universality, but hardly will be faster, the grammar is far away more complicated, I think. Or then to pass each file through a program (once, and not in real time), which, based on a whole dictionary of all possible words and their variants (because of the cases, genders, plurality), adds all possible optional hyphens, and then from the browsers is required only not to show those of them on which the line does not end. This is also very good variant, at least for official documents, like laws and other instructions, but for some their reasons the internet people have decided to abolish this special sign, together with the end of line or section, et cetera. If the browsers have allowed this then writing of a dictionary with splitting of the words would have been only tedious but not at all difficult task. But have they done what they like, only not to sit so and be satisfied, at least in some aspects, with the abilities of mentioned stone age.

     Well, gentlemen, how you want. I have given the example, have done a decent program, and if somebody wants to renew the idea for every editor or browser then let him (or her) try. I personally don't intent to occupy myself with programming by the simple reason that I have no time -- briefly putting it: I am little read and because of this have begun to translate myself, firstly in Russian, then (in the moment still) in English, think to make partial translation of my things also in German, and in addition to this I have planned work for me for the next at least 5 (but maybe whole 10) years, and according to the statistics I have in my disposition approximately 5 years life. That's the situation. But if some company thinks to become engaged with this then let it first sends me 1,000 euros, to make the sum round, only for presenting of this quite realizable idea, and then we will see what more they can require and will I be in position to offer it to them and on what conditions.

     Nov 2014







 Ваша оценка:

Связаться с программистом сайта.

Новые книги авторов СИ, вышедшие из печати:
О.Болдырева "Крадуш. Чужие души" М.Николаев "Вторжение на Землю"

Как попасть в этoт список

Кожевенное мастерство | Сайт "Художники" | Доска об'явлений "Книги"