The main project I’m working on at the moment is the design and development of an advanced analytics framework for customer analytics (I will be generic on the concrete application, but rather describe general purposes and features).
The idea is to have a place to store data related to users behaviors, coming from different data sources telling something about the user. Examples may be: e-commerce transactions on website/app, clickstreams, social media information, etc. Data will be collected real-time and with a periodic batch. I’ll refer to this as “data lake”. On top of the data lake, data applications will be developed to provide added value services to both users and platform owner.
The final product will:
- enable exploratory analysis and monitoring
- extract insights from analyzing multiple data sources at a time
- provide a unified environment for model training, testing and deployment
- develop data applications based on predictive analytics and matching learning techniques. Those applications will enhance the current capability of the platform (i.e. users churn predictions, user clustering and segmentations, uplifting models, etc.)
Data Structure: from atoms to aggregations
Incoming data will be converted into a unified data structure based on the idea of actions, in order to promote consistency across data sources. Actions will represents the most granular data stored (at least at the time of writing). Aggregated data structures will be created using actions in order to support data exploration, data visualization and data applications.
Architecture: horizontally scalable, multi tenant platform
The architecture will support a multi-tenants platform, meaning that it will be possible to integrate multiple operators/business owners, in a SaaS fashion. The platform will give to each business owner access to its data only, whereas we will have the ability to perform analysis on the entire set of data.
Actions will be stored in Hadoop, real-time aggregations will be performed and the results stored into Parquet/Impala tables for BI and data exploration, as well as into an operational db (MongoDB has been chosen for now) that will support real-time data applications.
Spark is the data processing engine that will take care of data crunching process for both aggregating data into MongoDB and Parquet, as well as for data exploration. Hadoop/MapReduce will be used for batch ETL jobs and for data historization.