Member-only story
Spark + Cassandra, All You Need to Know: Tips and Optimizations
Introduction
In this article, I will discuss the implications of running Spark with Cassandra compared to the most common use case which is using a deep storage system such as S3 of HDFS.
The goal is to understand the internals of Spark and Cassandra so you can write your code as efficient as possible to really utilize the power of these two great tools.
I will give you some tips regarding Spark tuning and Cassandra optimizations so you can maximize performance and minimize costs. I assume you already have basic knowledge of Spark and Cassandra.
First, let’s review the different deployment options you have when running Spark and Cassandra together.
Spark Cassandra Cluster
In general, we can think of two broad types of Spark clusters:
- Commodity Spark Clusters: This type of clusters, are a cost effective way to process large amount of data. They use cheap but slow storage systems running on low cost hardware. The idea is to take advantage of Spark parallelism to process big data in an efficient way. This is by far the most famous setup both on premises using HDFS and in the cloud using S3 or other deep storage system.
- High Performance…