Theory and practice of using HBase

Theory and practice of using HBaseGood afternoon! My name is Danil Lipova, our team at Sbertech started using HBase as a data warehouse. In the course of his study, experience accumulated, which he wanted to systematize and describe (we hope that many will be useful). All the experiments below were carried out with versions of HBase ???-cdh??? and ???-cdh???-beta1.
 
 
 
General architecture
 
Write data to HBASE
 
Reading data from HBASE
 
Caching of the data
 
Batch processing of MultiGet /MultiPut data
 
Strategy for breaking tables into regions (spiliting)
 
Fault Tolerance, Compactification and Locality of Data
 
Settings ...
+ 0 -

Business asks for the right to personal data of users

Business asks for the right to personal data of users  
 
Representatives of business, IT companies, banks and telecom operators proposed amendments to the law "On Personal Data". If adopted, companies will gain more control over user data. About this write "Vedomosti", who familiarized themselves with the text of the amendments.
 
discussed the future of data processing . As Leonid Tkachenko of MTS said:
 
 
We have three strategies:
 
 
A complete accumulation of all the customer data at all, even if we do not understand how to use it. Storage technology is cheap enough to store everything.
 
Open the data to the savants access to the data and ...
+ 0 -

From the loaded MPP DBMS - a vigorous Data Lake with analytical tools: we share the details of creating

All organizations that have anything to do with data, sooner or later, face the issue of storing relational and unstructured databases. It is not easy to find simultaneously convenient, effective and inexpensive approach to this problem. And to make it so that the data can successfully work with date-sentientists with models of machine learning. We did - and although we had to tinker, the final profit was even more than expected. We will describe all the details below.
 
 
From the loaded MPP DBMS - a vigorous Data Lake with analytical tools: we share the details of creating  
Parquet . [/i] For analytical problems, so-called wide tables with many columns ...
+ 0 -

Why do you need Splunk? Internet of things and industrial data

Why do you need Splunk? Internet of things and industrial data  
 
Today we want to talk about the Internet of things (IoT) and the industrial Internet of things (IIoT), and also about how Splunk is connected with this.
 
SplunkBase in the Internet of Thing section.
 
 

Splunk Industrial Asset Intelligence


 
 
 
In the spring of 201? Splunk announced the launch of its first platform Splunk Industrial Asset Intelligence (IAI) , directly related to IIoT, intended for process automation engineers in industrial companies.
 
 
This solution is intended for companies in production, energy, transportation, oil and gas, as well as other industrial sectors.
 
 
Splunk IAI correlates data from automated control systems (ACS), sensors, SCADA systems and applications, which makes it easy to monitor and diagnose equipment operation and problems in real time.
 
 
Since June 2? the Splunk IAI is released with limited availability, and a full release is scheduled for autumn of 2018 .
 
 
The official ...[/h]
+ 0 -

Neither GA nor NM. Its clickstream in Avito is

Neither GA nor NM. Its clickstream in Avito isWe collect more than two billion analytical events per day. Thanks to this we can learn a lot of necessary things: whether the hearts are pressed more than on the stars, at what hours they write more detailed descriptions, in which regions they often miss the green buttons.
 
The system of gathering and analyzing events can be called the clickstream. I'll tell you about the technical side of the Avto's clickstream: the device of events, their sending and delivery, analytics, reports. Why do you want yours, if there is Google Analytics and Yandex.Metrica, who is spoiled by the life of developers ...
+ 0 -

Deep Learning: Recognition of scenes and attractions on images

Deep Learning: Recognition of scenes and attractions on imagesTime to replenish the treasury of good Russian-language reports on Machine Learning! Moneybox itself is not replenished!
 
 
This time we will get acquainted with the fascinating story Andrei Boyarov about the recognition of scenes. Andrey is a programmer-researcher engaged in computer vision at Mail.Ru Group.
 
 
Scene recognition is one of the most actively used areas of computer vision. This task is more complicated than the studied recognition of objects: the scene is a more complex and less formalized concept, it is more difficult to identify the signs. From the recognition of scenes ...
+ 0 -

We are looking for speakers at Moscow Data Science Major

We are looking for speakers at Moscow Data Science Major
 
 
September 1 Mail.Ru Group and Open Data Science community will hold the largest Moscow Data Science Mitap.
 
 
We will open a new academic and working year for the whole day of sections and networking!
 
Fill in the form , to perform at the Mitap. Please note that completing the form does not guarantee your participation. We will review all applications and reply to the mail you indicated about the decision.
 
 
Mitap is focused on Data Science professionals and enthusiasts. We do not accept marketing or advertising reports. It is important for us that the reports are relevant and useful to the ...
+ 0 -

RabbitMQ - SQL Server

RabbitMQ - SQL ServerA week or two ago I saw message on the forum RabbitMQ Users , on how to send messages from SQL Server to RabbitMQ. As we work closely with this in Derivco , I left some suggestions there, and also said that I write in a blog about how it can be done. Part of my message was not quite true - at least until this moment (sorry, Bro, was very busy).
 
 
Awesome thing, this is your SQL Server . With its help, it is very easy to put information into the database. Getting data from the database using a query is just as easy. But getting the newly updated or pasted data is ...
+ 0 -

Why you should improve the training data, and how to do it

Hello!
 
 
And here we have, we can say, almost a new course - Data Scientist . Why almost? He just grew out of the BigData course, but now with much more focus on working with data, training, networking and that's it. New teachers, a little (about twenty per cent) of the new curriculum and the revised old, well, and as always, articles that seemed interesting to us within the framework of the course and open lessons on the same topics.
 
 
Go!
 
 
Why you should improve the training data, and how to do it
 
 
...
+ 0 -

RabbitMQ versus Kafka: application of Kafka in event-oriented applications

In the previous article we looked at the templates and topologies used in RabbitMQ. In this part, we turn to Kafka and compare it with RabbitMQ to get some idea of ​​their differences. It should be borne in mind that the architecture of event-oriented applications will be compared rather than data processing pipelines, although the line between these two concepts in this case will be rather blurred. In general, this is more a spectrum than a clear division. Simply, our comparison will be focused on a part of this spectrum associated with event-driven applications.
 
RabbitMQ versus Kafka: application of Kafka in event-oriented applications ...
+ 0 -