Identification of content profiles in VK

Bots to distinguish from people and the truth is complicated. I myself can not really do it myself. But I came up with a good bicycles
method, how to distinguish in VK "interesting people" from "not very interesting". In terms of network communication, of course, not in life.
 
Identification of content profiles in VK

 
VK put a restriction on the ability to download the contents of the walls of users , and slowly it hurts. Those. It is possible, but it is necessary to greatly refine, optimize and dodge to circumvent the restrictions.
 
 

The basic idea is


 
The main idea is that bots, dull (in the network plan) personality, all sorts of mass gatherers of subscribing friends do not care very much about who they have in friends, although they can write a lot of content posts on their walls. But sad people do not read their tape especially, but it does not need to be bothered at all. Especially it is not necessary to mass collectors of subscribers and stars.
 
 
But to people who have at least some kind of communicative interests in VC, just very important, who they have in friends. And, of course, they will not be in their friends to collect 6000 dudes, which are capricious only reposts, pictures of naked women and advertising discount barrels from a warehouse in Novy Urengoy.
 
 
And on this basis, you can try to make a criterion by which to select people who are interested in the content of their tape. Such people show the features of a real person. A person who, as a minimum, carries out a meaningful unilateral communicative act. In our time it is not so little.
 
 
Immediately I came up with two criteria:
 
 
The average dictionary of friends of the person for N last posts.
 
The percentage of posts without texts from the friends of the person being tested.
 
And on the basis of something like this, you can already try to build a model that would distinguish interesting people from not very interesting.
 
 

And how did I test this?


 
I chose 50 random friends and 50 random subscribers who met certain criteria that would cut off obvious fakes, children or people who do not enjoy it all. Type of the fact that the user should not be deactivated and he should have more than 50 existing friends.
 
 
I looked through all these people and identified who of them is a "bot" and who is not. Naturally, most of the friends were real, and most of the subscribers offered to buy something (but several real people were there).
 
 
Next, with each of the friends I checked, I took the first 100 posts, if there were so many on the wall. For each person considered two such factors:
 
 
 
The average size of a person's friends dictionary for their first 100 posts. Those. 50 friends, each approximately 100 posts. For each friend, all the words from 100 posts are shoveled into a pile, are summed up with and count the number of unique words of a friend. Next is the average for all 50 friends. From this value was taken root - SQRT (Dic).
 
If a friend has more than 60 out of 100 posts without words, he is designated as "lost". The percentage of "lost" people in friends is the second factor - Percent.
 
 
Another factor manifested itself by chance. This is the logarithm from Aidi to VC log10 (ID)
 
 
On this all I taught logistic regression , and got this:
 
 
log (OR) = ???-??? * log10 (ID) + ??? * SQRT (Dic) -??? * Percent
 
 
For the test part of the sample, a very good classifier with AUC = ??? turned out to be very good. Here is his ROC curve :
 

 
ROC curve of the classifier that determines the content of the human page
 

 
Some issues cause such a significance ID VK for the classification of the content of the personality, but it seems, alas, it does work. The further ID from ? the more likely it is that it's just a bot that is made to advertise microcredits. Without ID the classifier works, but worse. AUC = ???. This is not straight good, but it is not directly useless.
 
 
In any case, the final decision on the utility of the character for the decision maker.
 
 

Additional check


 
I took from one of my comrades all of his 5000 subscribers, where, of course, 95% of advertising slag and drove out regression without additional training. With cutoff by 20%, the results came out as TP = 78%, FP = 11% . Ie, in general, on an arbitrary person this also works more or less.
 
 

Can they do the bots that pass this test?


 
Yes, it's easy enough to generate a bot that has some pseudo-contented posts surrounded by friends, but so far nobody needs it. Well, it's hard to bother with different content, because if all the bots have the same generate, it's also easy to recognize.
 
 

Is it possible to make an application that would check people by ID?


 
Probably it is possible, but I do not want to do so. If anyone wants, let him do it himself. Like the method described, the idea of ​​it is simple.
 
 

Is it too corny?


 
That's enough. But all of a sudden someone comes in handy as a base for their developments. This method can easily be complicated, for example, considering not just the length of the dictionaries, but considering the content. Here you can already apply the full power of NLP and teach on content. You can also take more complex classifiers: trees, neural networks, etc. All this can be accommodated, but it is important that even simple something interesting is given.
+ 0 -

Add comment