Assumption: Spark and Clickhouse are up and running.
According to the official Clickhouse documentation we can use the ClicHouse-Native-JDBC driver. To use it with python we simply download the shaded jar from the official maven repository. For simplicity we place it in the directory from where we either call pyspark or our script.
If you use pyspark you must tell it to use this jar:
pyspark --driver-class-path ./clickhouse-native-jdbc-shaded-2.5.4.jar --jars ./clickhouse-native-jdbc-shaded-2.5.4.jar
If you use your own python script the following might server as a reference:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
if __name__ == '__main__':
from pyspark.sql import SparkSession
appName="Connect To clickhouse - via JDBC"
spark = SparkSession.builder.master('local').appName(appName).config("spark.driver.extraClassPath","./clickhouse-native-jdbc-shaded-2.5.4.jar").getOrCreate()
url="jdbc:clickhouse://127.0.0.1:9000"
user="default" #replace by whatever you use
password="" #same here
dbtable='nameOfDatabase.nameOfTable'
driver="com.github.housepower.jdbc.ClickHouseDriver"
pgDF=spark.read.format('jdbc').option('driver',driver).option('url',url).option('user',user).option('password',password).option('dbtable',dbtable).load()
print(pgDF.show())
Was this helpful?
20 / 18