A review of the cases of interesting implementation of Big Data in the companies of the financial sector

Case studies of the practical application of Large data
 
in the financial sector companies


 
A review of the cases of interesting implementation of Big Data in the companies of the financial sector Why this article?
 
 
This review examines cases of implementation and application of large data in real life using the example of "live" projects. For some, especially interesting, in all senses, case I dare to give my comments.
 
 
The range of case studies examined is limited to examples presented in public on site of Cloudera .
 
 

What is "Big Data"


 
There are technical jokes in the technical circles that "Big Data" is data, for processing which is not enough Excel 2010 on a powerful laptop. That is, if you need to operate 1 million lines on a sheet and more than 1?000 columns or more to solve a problem, then congratulations, your data is classified as "Big".
 
 
Among a number of more stringent definitions, we give, for example, the following: "Big Data" - data sets that are so large and complex that the use of traditional means of processing is impossible. The term usually characterizes data over which predictive analytics methods or other methods of extracting value from data are applied and rarely correlate only with the amount of data.
 
Reference to case
 
Link to the company
 
 
Description of the company
 
 
ICE - provides the largest trading platform for futures, stocks and options, provides clearing services and data services. NYSE
 
 
Theme of the project: # Infrastructure, # Compliance, # Data management.
 
 
The purpose of the project:
 
 
The exchange generates huge data sets in the course of its work, the use of this data is critical for optimizing and meeting the ever-growing demands of the market and customers. Once the time came when the existing DataLake infrastructure did not allow the timely processing of new and existing data, there was a problem of data sharing (data silos).
 
 
Based on the results of the project:
 
 
The updated technology platform (Cloudera Enterprise DataHub) provides internal and external users with access to over 20 petabytes of real-time data (30 terabytes are added daily), thereby improving the process of monitoring the market situation and monitoring Compliance. Legacy database was replaced with CDH Apache Impala, which allowed data subscribers to analyze the contents of DataLake.
 
 
Comment:
 
 
The theme of the case is interesting and undoubtedly in demand. A lot of sites generate real-time data streams that require operational analytics.
 
The case is purely infrastructural, there is not a word about the analyst. In the description of the case, nothing is said about the timing of the project, the difficulties in switching to new technologies. In general, it would be interesting to study the details and get to know the participants.
 
 

2. Cartao Elo - the company for the production and maintenance of plastic cards in Brazil


 
Reference to case
 
Link to the company
 
 
Company description:
 
 
Cartao Elo is a company that owns 11% of all issued plastic payment cards in Brazil, with more than one million transactions a day.
 
 
Theme of the project: # Infrastructure, # Marketing, # Business development.
 
 
The purpose of the project:
 
 
The company set a goal to bring the relationship with the client to the level of personalized offers. Even be able to anticipate the wishes of customers for a short period of time in time to offer an additional product or service. It requires an analytical platform capable of real-time processing of data from such sources as geolocation data about the location of customers from its mobile devices, weather data, traffic jams, social networks, the history of transactions on payment cards, marketing campaigns for shops and restaurants.
 
 
Based on the results of the project:
 
 
The company implemented DataLake on the Cloudera platform, which, in addition to data on transactions, stores other "unstructured" information from social networks, customer's geolocation of mobile devices, weather and traffic jams. DataLake stores 7 TB of information, and up to 10 GB is added daily. Provides personalized food offers to customers.
 
 
Comment:
 
 
Personal food offers - the topic is very popular especially in the Russian financial services market, many are developing it. It is unclear how the project managed to ensure the processing of data on transactions (and social networks and geolocation) in real-time mode. The data from the main transaction accounting system should instantly get into DataLake, in a case about it modestly silent, although this is very difficult, given their volumes and requirements for the protection of these cards. Also, the topic of "ethics of large data" is not disclosed, when a person is offered a product, he understands on the basis of this proposal that he is "watched" and intuitively refuses the product simply out of irritation. And then maybe change the credit card. Conclusion, most likely in DataLake stored 95% of transaction data, 5% of social networking data, etc. and on the basis of this data, models are built. I personally do not believe in real-time food offers.
 
 

3. Bank of Mandiri. The largest bank in Indonesia is


 
Reference to case
 
Link to the company
 
Company description:
 
 
Bank Mandiri is the largest Bank in Indonesia.
 
 
Theme of the project: # Infrastructure, # Marketing, # Business development.
 
 
The purpose of the project:
 
 
Implement a competitive advantage by implementing a technology solution that builds on personalized product offerings based on data. As a result of implementation, reduce the total cost of IT infrastructure.
 
 
Based on the results of the project:
 
 
As it is written in the case, after the implementation of the data-driven analytical solution Cloudera, the cost of IT infrastructure was reduced by 99%! Customers receive targeted product offers, which allows to increase the results of cross sell and upsell sales campaigns. Campaign costs are significantly reduced, due to more targeted modeling. The scale of the large data of the solution is 13TB.
 
 
Comment:
 
 
Kais seems to be hinting that following the results of the implementation, the company completely abandoned the infrastructure of relational databases for modeling product proposals. Even reduced IT costs by as much as 99%.
 
 
Data sources for the technological solution still remain 27 relational databases, customer profiles, data on transactions on plastic cards and (as expected) data from social networks.
 
 

4. MasterCard. International payment system


 
Reference to case
 
Link to the company
 
 
Company description:
 
 
MasterCard earns not only as a payment system that unites 2?000 financial institutions in 210 countries around the world, but also as a data provider for assessing the credit risks of counterparties (merchants) when assessing their applications for acquiring services by financial institutions (participants in the payment system).
 
 
Theme of the project: # Fraud, # Data management.
 
 
The purpose of the project:
 
 
Help your clients, financial organizations identify counterparties who were formerly insolvent and try to return to the payment system by changing their identity (name, address or other characteristics). MasterCard created for these purposes a database MATCH (MasterCard Alert to Control High-risk Merchants). This database stores the story of "hundreds of millions" of fraudulent businesses. Participants in the MasterCard payment system (acquirers) monthly make up to a million requests to the MATCH database.
 
 
The competitive advantage of this product is determined by the requirements for a short waiting time for query results and the quality of detection of the subject of the request. With the growth of the volume and complexity of historical data, the existing relational DBMS has ceased to meet these requirements against the backdrop of increasing the number and quality of client requests.
 
 
Based on the results of the project:
 
 
A distributed storage and data processing platform (CDH) was implemented, which provides dynamic scaling and management of the loading and complexity of search algorithms.
 
 
Comment:
 
 
The case is interesting and practically in demand. Well and thoughtfully mentioned is an important and time-consuming component to provide an infrastructure for the differentiation of access and security. Nothing is said about the timing of the transition to a new platform. In general, a very practical case.
 
 

5. Experian. One of the three largest credit bureaus of the world


 
Reference to case
 
Link to the company
 
 
Company description:
 
 
Experian - one of the three largest world companies (the so-called "Big Three") credit bureaus. Stores and processes information on credit history for ~ 1 billion borrowers (individuals and legal entities). Only in the US, credit files for 235 million physical and 25 million legal entities.
 
 
In addition to direct credit scoring services, the company sells marketing support services, online access to credit history of borrowers and fraud protection and identity theft.
 
 
The competitive advantage of the services of marketing support (Experian Marketing Services, EMS) is based on the company's main asset - accumulated data and physical and legal borrowers. EMS.
 
 
Theme of the project: # Infrastructure, # Marketing, # Business development.
 
 
The purpose of the project:
 
 
EMS helps marketers gain unique access to their target audience by modeling it using accumulated geographic, demographic and social data. Correctly apply the modeling of marketing campaigns accumulated (large) data on borrowers, including such real time data as "last committed purchases", "activity in social networks ", etc.
 
 
For the accumulation and use of such data, a technology platform is required to quickly process, store and analyze these diverse data.
 
 
Based on the results of the project:
 
 
After several months of research, the choice was made on a platform development - Cross Channel Identity Resolution (CCIR) engine, based on Hbase technology, a non-relational distributed database. Experian loads data into the CCIR engine through ETL scripts from multiple in-house mainframe servers and relational databases such as IBM DB? Oracle, SQL Server, Sybase IQ.
 
 
At the time of writing the case, Hive has stored more than 5 billion lines of data, with the prospect of 10-fold growth in the near future.
 
 
Comment:
 
 
Very bribes the specifics of the case, which is very rare:
 
 
- the number of cluster nodes (35),
 
- Terms of project implementation ( <6 месяцев),
 
- the architecture of the solution is presented concisely and competently:
 
 
Hadoop components: HBase, Hive, Hue, MapReduce, Pig
 
Cluster servers: HP DL380 (more than a commodity)
 
Data Warehouse: IBM DB2
 
Data Marts: Oracle, SQL Server, Sybase IQ
 
- load characteristics (100 million records per hour are processed, productivity is increased by 500%).
 
p.s.
 
 
The smile makes the advice "to be patient and practice a lot" before embarking on industrial development solutions on Hadoop /Hbase!
 
 
Very well presented case! I recommend reading separately. Especially a lot is hidden between the lines for people in the subject!
 
 

 

6. Western Union. The leader of the market of international money transfers


 
Reference to case
 
Link to the company
 
 
Company description:
 
 
Western Union is the largest operator of the international money transfer market. Initially, the company provided telegraph services (the inventor of Morse code - Samuel Morse stood at the origins of its foundation).
 
 
As part of remittance transactions, the company receives data about both senders and recipients. On average, the company carries out 29 transfers per second, with a total volume of 82 billion dollars (according to data for 2013).
 
 
Theme of the project: #Infrastructure. # Marketing, # Business development
 
 
The purpose of the project:
 
 
Over the years, the company has accumulated large amounts of transactional information, which it plans to use to improve the quality of its product and strengthen its competitive advantage in the market.
 
 
Implement a platform for consolidating and processing structured and unstructured data from multiple sources (Cloudera Enterprise Data Hub). Unstructured data includes, in the opinion of the author, sources such as "click stream data" (data on customer surfing when the company's website is opened), "sentiment data" (natural language processing) and chat-bot, customer surveys (surveys) about the quality of the product and services, data from social networks, etc.).
 
 
Based on the results of the project:
 
 
The data hub is created and populated with structured and unstructured data via streaming (Apache Flume), batch loading (Apache Sqoop) and good old ETL (Informatica Big Data Edition).
 
 
The data hub is the same repository of customer data, allowing you to create verified and accurate product offers, for example in San Francisco, WU forms separate targeted product offers for representatives of
 
 
- Chinese culture for clients of the local Chinatown branches,
 
- natives of the Philippines living in the area of ​​Daly City
 
- Latin Americans and Mexicans from Mission District
 
 
For example, sending a proposal is linked to a favorable exchange rate in the countries of origin for these national groups in relation to the US dollar
 
 
Comment:
 
 
The characteristics of the cluster are 64 nodes with the prospect of increasing to 100 nodes (nodes - Cisco Unified Computing System Server), the amount of data is 100TB.
 
 
A special emphasis is placed on ensuring security and delineating access to users (Apache Sentry and Kerberos). That speaks about the thoughtful realization and real practical application of the results of the works.
 
 
In general, I suppose that the project does not work at full capacity, the data accumulation phase is in progress, and there are even some attempts to develop and apply analytical models, but in general, the opportunities to use the model intelligently and systematically in the development of models are greatly exaggerated.
 
 

7. Transamerica. Group of companies for life insurance and asset management


 
Reference to case
 
Link to the company
 
 
Company description:
 
 
Transamerica is a group of insurance and investment companies. Headquartered in San Francisco.
 
 
Theme of the project: # Infrastructure, # Marketing, # Business development
 
 
The purpose of the project:
 
 
Due to the variety of businesses, customer data may be present in the accounting systems of various group companies, which sometimes complicates their analytical processing.
 
Implement an analytical marketing platform (Enterprise Marketing & Analytics Platform, EMAP), which will be able to store both the client's own data of all group companies and customer data from third-party suppliers (for example, all the same social networks). On the basis of this EPAM platform to form a verified product offer.
 
 
Based on the results of the project:
 
 
As stated in the case, the EPAM downloads and analyzes the following dаta:
 
 
- own client data
 
- CRM customer data
 
- Data on past insurance payments (solicitation data)
 
- Data on clients from third-party partners (commercial data).
 
- Logs from the Internet portal of the company
 
- Social media data.
 
 
Comment:
 
 
- Data is downloaded only using Informatica BDM, which is suspicious given the variability of sources and the diversity of the Architecture.
 
- The scale of "large data" is 30TB, which is very modest (especially given the mention of data on 210 million customers at their disposal).
 
- Nothing is said about the characteristics of the cluster, the timing of the project, the difficulties encountered in the implementation.
 
 

8. mBank. The 4th largest bank in Poland is


 
Link to the company
 
Reference to case
 
 
Company description:
 
 
mBank, was founded in 198? today has 5 million retail and 20 thousand corporate clients in Europe.
 
 
Theme of the project: # Infrastructure, #.
 
 
The purpose of the project:
 
 
The existing IT infrastructure did not cope with the ever increasing volumes of data. Due to delays in the integration and systematization of data, the date of the Scientists was forced to work with data in T-1 mode (ie, yesterday).
 
 
Based on the results of the project:
 
 
A data warehouse was built on the infrastructure of the Cloudera platform, with 300GB of information from various sources being filled daily.
 
 
Data sources:
 
 
• Flat files of source systems (mainly OLTP systems)
 
• Oracle DB
 
• IBM MQ
 
 
The speed of data integration into the storage has decreased by 67%.
 
ETL Tool - Informatica
 
 
Comment:
 
 
One of the rare cases of building a bank depository on Hadoop technology. There are no unnecessary pathos phrases "for shareholders" about large-scale reduction of TSS to infrastructure or capture of new markets due to point analysis of customer data. A pragmatic and believable case from real life.
+ 0 -

Add comment