ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect…

Follow publication

Member-only story

Spark + Cassandra, All You Need to Know: Tips and Optimizations

--

Photo by Jamie Street on Unsplash

Introduction

In this article, I will discuss the implications of running Spark with Cassandra compared to the most common use case which is using a deep storage system such as S3 of HDFS.

The goal is to understand the internals of Spark and Cassandra so you can write your code as efficient as possible to really utilize the power of these two great tools.

I will give you some tips regarding Spark tuning and Cassandra optimizations so you can maximize performance and minimize costs. I assume you already have basic knowledge of Spark and Cassandra.

First, let’s review the different deployment options you have when running Spark and Cassandra together.

Spark Cassandra Cluster

In general, we can think of two broad types of Spark clusters:

  • Commodity Spark Clusters: This type of clusters, are a cost effective way to process large amount of data. They use cheap but slow storage systems running on low cost hardware. The idea is to take advantage of Spark parallelism to process big data in an efficient way. This is by far the most famous setup both on premises using HDFS and in the cloud using S3 or other deep storage system.
  • High Performance

--

--

Published in ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Written by Javier Ramos

Certified Java Architect/AWS/GCP/Azure/K8s: Microservices/Docker/Kubernetes, AWS/Serverless/BigData, Kafka/Akka/Spark/AI, JS/React/Angular/PWA @JavierRamosRod

No responses yet

Write a response