从0到1搭建DeltaLake大数据平台

BI, AI, 大数据学习 / 2024-10-21 / 原文

1. 下载VMWare, 安装CentOS9虚拟机

2. 配置用户,创建目录

2.1. 以管理员身份登录,创建Spark用户给Spark使用

sudo adduser sparkuser

2.2. 修改新用户密码 (123456)

sudo passwd sparkuser

2.3. 给新用户Sparkuser Sudo权限

  切换到Root: su -

  给sparkuser权限: sparkuser ALL=(ALL) NOPASSWD:ALL

  退出保存: :qw

2.4. 以新建的sparkuser用户登录,创建Spark目录

sudo mkdir /opt/spark

2.5. 修改spark目录owner为sparkuser

sudo chown -R sparkuser:sparkuser /opt/spark

3. 下载spark包,上传到虚拟机,解压到spark目录

sudo tar -xvzf spark-3.5.3-bin-hadoop3.tgz -C /opt/spark --strip-components=1
sudo chown -R sparkuser:sparkuser /opt/spark

(The --strip-components=1 option removes the top-level directory from the extracted files, so they go directly into /opt/spark.)

4. 设置环境变量

Add Spark to your PATH by editing the .bashrc or .bash_profile of the Spark user.

echo "export SPARK_HOME=/opt/spark" >> /home/sparkuser/.bashrc

echo "export PATH=\$PATH:\$SPARK_HOME/bin" >> /home/sparkuser/.bashrc

source /home/sparkuser/.bashrc

5. JAVA Setup

  安装Java

sudo yum install java-11-openjdk-devel  

  查看版本

java -version  

  查看路径

readlink -f $(which java)  

  设置环境变量

echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.20.1.1-2.el9.x86_64" >> /home/sparkuser/.bashrc

echo "export PATH=$JAVA_HOME/bin:$PATH" >> /home/sparkuser/.bashrc

source /home/sparkuser/.bashrc  

6. 启动Spark

spark-shell

7. 启动spark deltalake

bin/spark-shell --packages io.delta:delta-spark_2.12:3.2.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"  

8. 测试deltalake

val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")