I was in a situation a while back where I needed to make an API call to clean up some data in a dataframe. Here is the approach I took.

Get all the distinct values for the particular column out of the df
Make multithreaded and batched calls to the API (important for dealing with a LOT of data)
Add a new column to the df with the results added

Here is what a sample of that looks like in code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


from pyspark.sql.functions import col, udf

def update_from_dict(my_dict):
    def _in_dict(col, my_dict):
        return my_dict(col, 0)
    return udf(lambda l: _in_dict(l, my_dict))

# this is slow
elements = [i[elem] from i in df.select(col_name).distinct().collect()]
results = call_api(elements)
df = df.withColumn(col_name, update_from_dict(results)(col(col_name)))

Performance issues?

Absolutely. Any time you are pulling data out of HDFS, out of JVM, and into Python you incur a cost. Here we are using a UDF (User Defined Function) to access the dictionary results for every row in the dataframe.

When at all possible, you should avoid using UDFs and try to keep as much logic as you can in Spark.

Final thought/wisdom

I often find that making things go fast is just the FIRST challenge you face when working with Spark.

Once you’ve mastered solutions to that problem, the next level of trouble comes from external dependencies. Calls to APIs, Databases, and other networked resources are going to give you heartburn. They often aren’t designed to handle 150+ open connections from a Spark cluster and will start to overload.

So in those situations you have to first get Spark to go fast. . .and then figure out how to intelligently reign it in and slow it down. That isn’t something Spark wants to do.

I have more to share on this problem, but that’ll wait for another post.

Update PySpark dataframe from a dictionary

Performance issues?

Other Solutions

Final thought/wisdom