Member-only story

Spark + Cassandra, All You Need to Know: Tips and Optimizations

Javier Ramos

Published in

ITNEXT

30 min readNov 2, 2020

Introduction

In this article, I will discuss the implications of running Spark with Cassandra compared to the most common use case which is using a deep storage system such as S3 of HDFS.

The goal is to understand the internals of Spark and Cassandra so you can write your code as efficient as possible to really utilize the power of these two great tools.

I will give you some tips regarding Spark tuning and Cassandra optimizations so you can maximize performance and minimize costs. I assume you already have basic knowledge of Spark and Cassandra.

First, let’s review the different deployment options you have when running Spark and Cassandra together.

Spark Cassandra Cluster

In general, we can think of two broad types of Spark clusters:

Commodity Spark Clusters: This type of clusters, are a cost effective way to process large amount of data. They use cheap but slow storage systems running on low cost hardware. The idea is to take advantage of Spark parallelism to process big data in an efficient way. This is by far the most famous setup both on premises using HDFS and in the cloud using S3 or other deep storage system.
High Performance…

ITNEXT

Spark + Cassandra, All You Need to Know: Tips and Optimizations

Introduction

Spark Cassandra Cluster

Published in ITNEXT

Written by Javier Ramos

No responses yet