Yellowbrick shows that its data warehouse is not smoke and mirrors


Formed back in 2014 with first product released three years later, Yellowbrick has specialized at the high-performance end of the data warehousing spectrum that includes analytics of real-time data. It cites benchmarks of the ability to ingest up to 10 terabytes of data per hour. Until now, it has done so through a proprietary hardware architecture that includes specialized FPGA chips that made it sound to us like the second coming of Netezza appliances. While impressed with their performance stats, we were initially leery about how Yellowbrick’s market success would scale when defined by custom hardware.

But over the past year, the company has started ramping up its Yellowbrick Cloud Data Warehouse services that instead leverages the specialized hardware that are offered by providers like AWS, which have much more buying power than niche providers like Yellowbrick alone. The early implementation of the cloud service is still on a traditional data warehouse architecture where storage is directly attached to compute. But this week, at its first virtual customer summit, Yellowbrick is taking the wraps off a new cloud-native architecture that it brands as “Andromeda optimized” instances that are now in private preview. It first announced their existence earlier this month, and is diving into the dirty details this week.

Andromeda will run on a modern cloud-native architecture, with compute and storage separate, on Kubernetes clusters and run with microservices. It positions Andromeda as a distributed cloud, meaning that it can deploy on K8s clusters running on premises, in the public cloud, and/or at the edge. Let’s bring out the inner geek and take a stroll down the yellow brick road.

andromeda-strain.jpg

YELLOWBRICK’S ANDROMEDA STRAIN

Andromeda builds on several pillars of Yellowbrick’s original architecture, which essentially turns everything from the OS kernel to file system, networking, and storage layer inside out.

Processing and storage are tiered. Yellowbrick makes use of NVMe Flash as primary storage and Level 3 cache in the processor, bypassing disk storage for obvious reasons, but also DRAM for several surprising reasons. Admittedly, using DRAM as buffer for typical operations such as joins, sorts, and aggregates is fast. However, when analyzing large volumes of data, loading and unloading to and from memory adds latency. Furthermore, the performance of modern NVMe Flash and underling network backplane more than compensate for the speed, but also the overhead of loading and unloading data from in-memory buffer.

Yellowbrick has replaced several pillars of the stack wholesale. It utilizes its own clustered file system to [protect against data loss, and it bypasses some of the Linux kernel because of the limitations inherent in managing multiple threads and potential glitches in memory management (e.g., data not being buffered in the most optimal place). Yellowbrick instead substitutes its own kernel the eliminates a number of high overhead operations such as context switching, where the CPU switches been different processes or threads. While the Linux stack is still responsible for collecting logs and statistics, core processing of threads, memory management, devices drivers, and networking goes over to the Yellowbrick kernel.

Yellowbrick is hardly alone in viewing the Linux kernel as bottleneck. RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE) is a standard protocol that enables direct access to memory while bypassing the OS. Oracle has tapped RoCE as pillar to accelerate its latest X8 generation of Exadata systems to turbocharge performance, along with related measures such as implementing faster 100 GbE Ethernet networking. Significantly, Yellowbrick will also be supporting RDMA as one of the protocols for Andromeda.

Query execution relies on a hybrid architecture of column and row store that should sound familiar to IBM, Oracle, and MariaDB customers. But, while most implementations of row and column were designed for databases handling transaction processing along with analytics, in this case, Yellowbrick utilizes row but for a strictly analytic purpose: because rows perform inserts much faster than columns, rows are used for data ingestion of real-time streams from Kafka or other change-data-capture (CDC feeds). The row store is log-oriented, enabling new rows to be appended in real time on highly-mirrored volumes, such as EBS in AWS or replicated SSD instances on other clouds. Periodically, row updates are fed to the column store where the bulk of the data resides. As rows are written to the column store, they are partitioned into small blocks stored across multiple shards, with related data optionally clustered by keys and with highly detailed indexes. As with cloud services such as Oracle MySQL Data Warehouse or MariaDB SkySQL, the query optimizer.

WHY IS YELLOWBRICK SPILLING THE BEANS NOW?

There’s a method to the madness here. Before this, Yellowbrick was only available on custom hardware that seemed a throwback to the days of expensive, proprietary data warehouse appliances. Yes, we’re thinking of the good old days of Netezza and Teradata. As Oracle showed with its implementation of RoCE in the X8M generation of Exadata, in many parts of the stack, off-the-shelf technology has more than caught up. For instance, 100-GbE Ethernet is now much faster than common strains of InfiniBand, which used to be considered the premium networking tier. And more than five years ago, Oracle converted its exadata line to an all Intel, standard 2-socket CPU architecture. And initially, the cloud was all about throwing large masses of commodity hardware at a problem.

But in recent years, each of the major cloud providers have taken active roles developing their own custom, optimized instances, and this is the wave that Yellowbrick is poised to leverage as it rolls out its next-generation distributed cloud. Yellowbrick may not have as much control over hardware specs as it defers to cloud providers, but it will be publishing recipes with recommended instance types to guide its customers, starting with the instance types on which it will be hosting its managed cloud service.



Source link