Again…Spark will hammer a relational DB you connect to if you let it.
Recently I wrote about solving a problem with Spark hammering a database.
That initial problem occurred when trying to download too much data too quickly from a relational DB. Spark always wants to do things fast and can overwhelm a DB with too many connections.
But what about a similar problem while writing to the database? There are three main places you can alter configuration to solve this issue (or at least mitigate and alleviate).
Again…spark is designed to do a lot of operations very fast, so it will hit the DB as hard as it can without thinking twice and doesn’t offer any direct settings for throttling JDBC connections.
However, there are a couple of settings you can pass that will artificially slow things.
batchsize will force Spark to spend more time sending data over the network as this limits how many rows Spark will send to the DB with each call. Usually you will see recommendations to increase it above the default of 1000 in order to increase the speed at which Spark writes to the DB. But here we want it lowered of course.
numPartitions will limit how Spark chops up the work between all the workers/CPUs it has in the cluster. If you set this to a higher number you can end up with 100+ simultaneous connections to the DB. By setting it to 1, we can keep that from happening.
rewriteBatchedInserts is just a general postgres performance optimization flag. I couldn’t find out much about it except that it “provides 2-3x performance improvement”.
In my case I was running on AWS EMR. There are a couple of things you can optionally do here to help your situation.
- Shrink the number of workers in your cluster (Not an option for me)
- Shrink the size of the machines you are using (Not an option for me)
- Switch to a machine type that is…say…optimized for memory but weaker on networking
This is something in my case I didn’t get time to experiment with, but will be testing in the coming weeks. Just like changing some of the configs above, this is kind of a hack in that we aren’t able to directly throttle what Spark is doing, but instead are merely constraining it’s resources to force it to slow down.
There are a number of settings on the database side that you can do to limit the number of connections and how much data those connections can send.
In my case I didn’t have the option of tweaking the DB. I was also just more intrigued about what I could do from the Spark side to force it to slow down given that there are many other situations (APIs as one example) where you might want to throttle Spark. Exploring that was interesting to me.
Also, when doing an UPDATE or DELETE from Spark, you should NEVER update the table directly, but instead INSERT to a temporary table where you have disabled indexes, replication, etc first. Then have Spark Update/Delete the main table using the rows from the temp table.
Throttling Spark is like trying to rein in a horse.
By default Spark really doesn’t want you to do it. That makes sense philosophically when you think about the job Spark is meant to do vs what we typically ask other Python application to do.
I’ve only had a couple of brief opportunities recently to dive into this topic and I don’t think I have done it justice yet.
I’m confident there is a good best practice or design pattern for this.
Perhaps stage the data as parquet/avro files in an S3 bucket and have a separate non-spark process consume that data for API/DB operations?
Maybe write out the dataframe in chunks to HDFS and have a separate psycopg2/psql script running to insert into the database while Spark continues to work on other jobs?
I’m sure I’ll have more concrete things to say about this topic in the future.
Links I found useful