How to access your clickhouse database with Spark in Python

Assumption: Spark and Clickhouse are up and running.

According to the official Clickhouse documentation we can use the ClicHouse-Native-JDBC driver. To use it with python we simply download the shaded jar from the official maven repository. For simplicity we place it in the directory from where we either call pyspark or our script.

If you use pyspark you must tell it to use this jar:

pyspark --driver-class-path ./clickhouse-native-jdbc-shaded-2.5.4.jar --jars ./clickhouse-native-jdbc-shaded-2.5.4.jar

If you use your own python script the following might server as a reference:

import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
if __name__ == '__main__':
        from pyspark.sql import SparkSession
        appName="Connect To clickhouse - via JDBC"
        spark = SparkSession.builder.master('local').appName(appName).config("spark.driver.extraClassPath","./clickhouse-native-jdbc-shaded-2.5.4.jar").getOrCreate()

        url="jdbc:clickhouse://127.0.0.1:9000"
        user="default" #replace by whatever you use
        password="" #same here
        dbtable='nameOfDatabase.nameOfTable'
        driver="com.github.housepower.jdbc.ClickHouseDriver"

        pgDF=spark.read.format('jdbc').option('driver',driver).option('url',url).option('user',user).option('password',password).option('dbtable',dbtable).load()
        print(pgDF.show())

Was this helpful?

18 / 18