I was in a situation a while back where I needed to make an API call to clean up some data in a dataframe. Here is the approach I took.
- Get all the distinct values for the particular column out of the df
- Make multithreaded and batched calls to the API (important for dealing with a LOT of data)
- Add a new column to the df with the results added
Here is what a sample of that looks like in code
Absolutely. Any time you are pulling data out of HDFS, out of JVM, and into Python you incur a cost. Here we are using a UDF (User Defined Function) to access the dictionary results for every row in the dataframe.
When at all possible, you should avoid using UDFs and try to keep as much logic as you can in Spark.
You could use a UDF to have Spark directly call the API for you. I originally had the code doing this, but it was calling the API for one row at a time, which was taking FOREVER.
So I went with a solution where I could make bulk calls and multithreaded calls to speed things up. There might be a way to do that directly from Spark without a UDF, but I never got around to investigating that.
I often find that making things go fast is just the FIRST challenge you face when working with Spark.
Once you’ve mastered solutions to that problem, the next level of trouble comes from external dependencies. Calls to APIs, Databases, and other networked resources are going to give you heartburn. They often aren’t designed to handle 150+ open connections from a Spark cluster and will start to overload.
So in those situations you have to first get Spark to go fast…and then figure out how to intelligently reign it in and slow it down. That isn’t something Spark wants to do.
I have more to share on this problem, but that’ll wait for another post.