PM1: Architecting A Big Data Platform

What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop, Spark and big data ecosystem fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

By tracing the flow of data from source to output, we'll explore the options and considerations for components, including:

Acquisition: from internal and external data sources
Ingestion: offline and real-time processing
Storage
Analytics: batch and interactive
Providing data services: exposing data to applications

We'll also give advice on:

Tool selection
The function of the major Hadoop components and other big data technologies such as Spark and Kafka
Integration with legacy systems

John Akred likes to help organizations become more data driven. Mr. Akred has over 15 years of experience in advanced analytical applications and analytical system architecture. He is a recognized expert in the areas of applied business analytics, machine learning, predictive analytics, and operational data mining. He has deep expertise in the application of various architectural approaches such as: distributed non-relational data stores (NoSQL), stream processing, in-database analytics, event-driven architectures and specialized appliances; to real-time scoring, real-time optimization, and similar applications of analytics at scale. John received a BA in Economics from the University of New Hampshire, and a MS in Computer Science, focused on Distributed Systems from DePaul University.

A leading expert on big data architecture and Hadoop, Stephen O'Sullivan brings 20 years of experience to creating scalable, high-availability data and applications solutions. A veteran of WalmartLabs, Sun, and Yahoo!, Stephen leads data architecture and infrastructure.

An Apache Cassandra committer and PMC member, Gary specializes in building distributed systems. Recent experience includes creating an open source high-volume metrics processing pipeline and building out several geographically distributed API services in the cloud.