"Three in a Boat, Poverty and Dogs," or how the Antiplagy seeks paraphrase
A new school year has come. The students received a schedule and began to think about
future session. Writing coursework, diplomas, articles and dissertations is not far off. So, there is also an analysis of texts for the availability of borrowing, and verification reports, and other headache student and administrator pain. And hundreds of thousands of people (without jokes - we figured!) Already there is a natural question - how to deceive "Antiplagiarism". In our case, almost all methods of deception are somehow connected with distortions of the text. We have already taught "Antiplagiarism" to find the text "distorted" with the help of translation from English into Russian (we already wrote about it in
? the first article of our corporate blog
). Today we are going to talk about how to discover the most effective, albeit laborious, way of distorting the text - paraphrase.
) Alberto Barrón-Cedeño ).
Let's take a closer look at the example of the well-known story "Mumu"[как и в названии статьи, в нем тоже фигурируют собака, люди и лодка :-) ], what can you do with the text so that its meaning is preserved, and the sentences looked different.
1. The first thing that comes to mind is to replace most of the words with synonyms. This is the simplest thing you can do with text. The meaning of this will not change, and the text at first glance will change. Such a trick and use the program synonyms. In this case, they replace words without taking into account the context, but simply choosing a word from the list of synonyms, so the sentence processed by such a program very often looks rather absurd. also applies to this method of paraphrase. Perifrase - descriptive designation of the object on the basis of highlighting some of its quality, feature, features, for example, "blue planet" instead of "Earth", "one-armed bandit" instead of "slot machine", etc.
The lady began to beckon to her with her kindly voice.
The boyard began to call her to herself with her cheerful voice.
2. Replacing some parts of speech with others also allows you to change the structure of the sentence. For example, very often a verb is replaced by a noun and vice versa.
One fine summer day the lady with her prizhivalkami paced the living room.
The lady's walk with her prizhivalkami took place on a beautiful summer day.
3. Another simple way to change the text structure is simply to divide sentences into simpler ones, or vice versa, to combine them into long ones.
Gerasim was a little surprised, but he called Mumu, picked her up from the ground and gave it to Stepan.
Gerasim was a little surprised, but after calling Mumu. He picked it up from the ground and gave it to Stepan.
4. Substantially and very original, the proposal changes with the help of a passive voice.
The mistress told me to call her elderly prizhivalku.
The elder prizhivalka was called a lady.
These are just typical techniques. Obviously, it's very difficult to discover a good paraphrase. Sometimes this is only possible for specialists with deep knowledge in the subject area of the text. But for the task that we solve, this is not required. After all, deep rephrasing requires considerable effort, and therefore, a great time investment. Most likely, it will be easier for the student to write his work, than to waste time on serious rephrasing of someone else's text, which, despite the costs, can be detected during the check.
Therefore, our goal is a relatively simple paraphrase, which can be performed by the "spinal cord", i.e. without much expenditure of thought and time.
As a matter of fact, rephrasing is a "native sister" of translation into another language. Words change, but the meaning remains. We can say that the paraphrase of the Russian-language text is actually a translation from Russian into Russian.
That is why the paraphrase detection algorithm turned out to be a "close relative" algorithm for detecting transfer borrowings . So, how does the process of detecting borrowings in this case take place:
1. The Russian-language document to be checked comes to the entrance.
The machine translates the Russian text into English.
3. There is a search for candidates in the sources of borrowings from the indexed collection
4. A comparison is made between each candidate found with
English version of
ohm - definition of boundaries of borrowed fragments.
5. Boundaries of fragments are transferred to the Russian version of the document being checked. When the process is completed, a verification report is generated.
An important difference is that the algorithm parameters are adjusted on other data and taking into account the specifics of the Russian language. In doing so, we save the tuning strategy with an orientation to accuracy, sacrificing completeness. Our task is to minimize the number of false positives, even if bypassing "some goals".
Tuning from the "high tailor"
Paraphrase is, of course, a laborious way of distorting the text. At the same time, not all methods of rewriting are equally useful, making the text unrecognizable. Trying to reduce the time spent, the author uses the simplest methods of text modification, which are detected by the algorithms of the system and do not bring any result. Therefore, after the first unsuccessful attempt to overstate originality, the text begins " to tune ". How it works: different combinations of methods are used, and after each such combination, the modified text is loaded into the system - to check how successful the paraphrase was and whether the user could get the coveted percentage of originality. The result is a chain of texts, each of which has been paraphrased with varying degrees of severity. To extract such a chain is a fairly simple engineering problem. Our study of such "chains" revealed (most at the same time confirming the results of the same Alberto Barrón-Cedeño ) The most frequent methods of modification and gave a rich material for learning new algorithms.
Let's do a little experiment. Take a short extract from Turgenev's already mentioned story:
An hour after all this alarm the door of the room disappeared and Gerasim appeared. He was wearing a festive caftan; he led Mumu on a string. Eroshka stepped aside and let him pass. Gerasim went to the gate. The boys and all who were in the courtyard followed him with their eyes, silently. He did not even turn around; the cap was only put on the street. Gavrila sent after him the same Eroshka as an observer. Eroshka saw from afar that he had entered the tavern with the dog, and began to wait for his exit
We will try to deceive "Anti-plagiarism". First, let's try the automatic text synonymizer. Such programs do not differ in particular
quality - they just take the words and replace them with synonyms from the dictionary, not taking into account the context. Therefore, the texts processed by such a program often look rather clumsy. Here's what happened after processing one of these programs:
After a while after this anxiety the doors of the konurka resolved, and Gerasim introduced himself. He was wearing a solemn caftan; someone led Mumu in a string. Eroshka stepped aside and let him make his way. Gerasim rushed to the gate. Boys and all without exception the former in the courtyard accompanied him with their eyes, without saying a word. He even did not turn around in any way: he put on his headdress only in the street. Gavrila sent behind him the same Eroshka in the property of an observer. Eroshka saw from afar, the fact that someone joined the tavern together with the dog, and began to wait for his release
Note that at least one word has been replaced in each sentence. Such a seemingly small change is enough to make "ordinary Antiplagiat" cease to compare the rewritten sentences with the original.
Now try to compare the pairs of sentences of the source text and rewritten with our algorithm. For this we will use cosine measure of similarity . As in the algorithm detection of transferable borrowings , each sentence is represented as a vector of large dimension. By measuring the cosine of the angle between a pair of such vectors, we can conclude how much these vectors are "similar" to each other, and, accordingly, how similar the sentences to which these vectors correspond.
Here's what happened after comparing the sentences with our algorithm:
For clarity, we depicted the magnitude of the cosine in the form of a thermal scale. That is, than the "hot" color between a pair of sentences, the greater the cosine and the more similar are the sentences from this pair. Note that the smallest values of the cosine have been sentences, in which substitutions for synonyms are very poorly suited to the context. For example, "so" and "thus and" are really very often synonymous, but in this context such a replacement is completely out of place.
Let's try ourselves now in the role of synonyms and rewrite the text with the preservation of meaning. But unlike the program, all our changes are grammatically consistent and fit well into the context. Here's what we got:
Again, in this case, the algorithm gives a sufficiently high similarity rating for most of the sentences. Proposals that received a low rating were subject to a rather profound transformation: the grammatical structure was greatly altered in them. Even a person does not immediately answer if these suggestions are similar, quickly glancing over them with their eyes.
And now what to do with all this?
Naturally, the best way to understand whether a new algorithm works or not is to investigate the quality of its work on real data. So we put a new paraphrase detection module in production and drove real requests through it (not yet showing the results to users). The works were checked as an effective search algorithm for borrowing - "verbatim comparison", and a new algorithm - "paraphrase detection". Then we compared about 10 thousand reports on the checks of the downloaded work, created by both algorithms. The results turned out to be interesting.
This graph shows the distribution of the percentage of borrowing for both algorithms. It can be seen that "paraphrase detection" is on average 10 percent more borrowing than a "verbatim comparison".
On the second chart, the horizontal axis represents the absolute difference between the percentage of borrowings of the proposed algorithm and the current one. The difference greater than 0 means that "paraphrase detection" found more than a "verbatim comparison".
Paraphrase as a way of distorting the text is actually used when writing works;
The number of "triggers" did not grow radically, the algorithm finds a really paraphrased text;
As with transferring loans, the Anti-Plagiarism system received a new module - a paraphrase detection system;
Ku and of course, our classic - it's better to create your own mind!
The architecture of the paraphrase detection algorithm and the first results of the work were shown in Workshop Big Scholar , devoted to the analysis of scientific data, which this year was held in the framework of one of the main conferences on machine learning - KDD 2018 .
The paraphrase detection module is deployed on production and is already used by teachers and students when checking texts for borrowing.
The article is prepared in co-authorship with Rita_Kuznetsova , Oleg_Bakhteev , Kamil Safin and chernasty . The original image for creating the input illustration was taken from here: demotivators.cc .
It may be interesting
Watch Football Online Free