Hue安装以及集成组件

nwnusun / 2024-01-16 / 原文

image-20231130154444492

介绍

Hue支持以下服务

HDFS、HBase、S3

Hive、Impala、Presto、Trino、Flink、Phoenix、Druid

Solr、ES、ClickHouse

Spark、Yarn、Ooize、Knox

Oracle、PG、DB2、Vertica、MYSQL

最新5.x版本支持Ozone

用于数据库、数据仓库甚至数据湖的 Hadoop 用户体验 (HUE)

Hue 是一个基于 Web 的开源图形用户界面,它将多个不同的 Hadoop 生态系统项目组合到一个可配置的界面中。

img

编译Hue

注意:Hue 4.11 是支持 Python 2.7 的最后一个版本。新版本将基于 Python 3.8 及更高版本

Python版本

  • Python3.8
  • Python3.9

官方

  • Hue Custom Databases | 6.3.x | Cloudera Documentation

设置镜像

  • pip install pip -U
    pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
    

注意

采用Python2.7编译比较顺利,Python3.8会出现各种模块缺失错误,暂时没有解决,怀疑Python安装的有问题,毕竟官方是支持Python3.8的,待定

安装Python3.8

安装Anacoda的方式,这样方便切换Python版本

export export PYTHON_VER=python3.8
ln -s xxxx/pip3.8 /usr/bin
ln -s xxxx/python3.8
PREFIX=/srv/dstore/1.0.0.0/  make install  # 注意,编译成功后,虚拟环境下的python文件头带这个路径,如果移到别的目录,则需要在/hue/build/env/bin下修改为上述的目录
pip install virtualenv
yum -y install rsync
yum -y install ant asciidoc cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-plain gcc gcc-c++ krb5-devel libffi-devel libxml2-devel libxslt-devel make mysql mysql-devel openldap-devel python-devel sqlite-devel gmp-devel
安装node.js
wget https://nodejs.org/dist/v16.13.0/node-v16.13.0-linux-x64.tar.xz
tar -xvf node-v16.13.0-linux-x64.tar.xz
export PATH=$PATH:/opt/software/node-v16.13.0/bin
npm config set registry=http://registry.npm.taobao.org 
安装maven
wget https://dlcdn.apache.org/maven/maven-3/3.8.3/binaries/apache-maven-3.8.3-bin.tar.gz
tar  -zxvf apache-maven-3.8.3-bin.tar.gz

编辑settings.xml

<mirror>
  <id>alimaven</id>
  <name>aliyun maven</name>
  <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
  <mirrorOf>central</mirrorOf>
</mirror>

开始编译

export PYTHON_VER=/xxx/xxx/python.xx  # 如果不设置Python环境变量,则取默认python2.7
PREFIX=/srv/dstore/1.0.0.0 make install 

在安装ES、Clickhouse的过程中遇到一些库问题,且接入Ozone需要Hue 5.x版本,所以尝试重新拉包编译

Python 3.8

错误

问题一

ImportError: libxslt.so.1: cannot open shared object file: No such file or directory

解决方式

yum -y install  asciidoc

集成Hdfs

HDFS区分HA

修改Hue配置 pseudo-distributed.ini

# Configuration for HDFS NameNode
  # ------------------------------------------------------------------------
  [[hdfs_clusters]]
    # HA support by using HttpFs

    [[[default]]]
      # Enter the filesystem uri
      fs_defaultfs=hdfs://cluster
      # NameNode logical name.
      logical_name=cluster  

      # Use WebHdfs/HttpFs as the communication mechanism.
      # Domain should be the NameNode or HttpFs host.
      # Default port is 14000 for HttpFs.
      webhdfs_url=http://sun75:14000/webhdfs/v1 

      # Change this if your HDFS cluster is Kerberos-secured
      ## security_enabled=false

      # In secure mode (HTTPS), if SSL certificates from YARN Rest APIs
      # have to be verified against certificate authority
      ## ssl_cert_ca_verify=True

      # Directory of the Hadoop configuration
      ## hadoop_conf_dir=$HADOOP_CONF_DIR when set or '/etc/hadoop/conf'
      hadoop_conf_dir=/srv/dstore/1.0.0.0/hdfs/etc/hadoop
      # Whether Hue should list this HDFS cluster. For historical reason there is no way to disable HDFS.
      ## is_enabled=true

可以看到其中有Httpfs的配置信息,那么为啥需要这个东西?ChatGPT给出结论如下

  1. 跨平台和跨语言支持: HTTPFS提供了一种标准的HTTP访问方式,因此可以被各种编程语言和平台所支持,使得与HDFS的交互更加灵活和便利。
  2. 安全性: 通过HTTPFS,可以更容易地实现对HDFS的访问控制和安全认证,例如使用基于HTTP的身份验证和授权机制来保护数据。
  3. 防火墙穿透: 在某些网络环境下,由于安全策略的限制,直接访问Hadoop集群的API接口可能会受到限制,而HTTP协议通常更容易穿透防火墙。
  4. 简化配置: 使用HTTPFS可以简化客户端配置,因为它使用标准的HTTP协议,不需要像直接连接namenode的API接口那样进行复杂的配置。

总结:

尽管可以直接连接namenode的API接口进行文件系统操作,但使用HTTPFS仍然具有一定的优势,特别是在跨平台、安全性和网络环境受限的情况下。

从官方注释可知,可以采用2种不同的方式与HDFS通信

  1. WebHdfs:WebHdfs是Hadoop中的一种RESTful风格的HDFS访问协议。它通过HTTP协议提供对HDFS的访问,允许用户通过REST API执行文件系统操作。在Hue中,通过指定WebHdfs的URL,Hue可以使用HTTP来与HDFS进行通信。
  2. HttpFs:HttpFs是Hadoop的一个HTTP代理服务,允许通过HTTP方式访问HDFS。它提供了一个RESTful API,允许用户通过HTTP请求对HDFS进行文件系统操作。在这里,如果WebHdfs不可用,HttpFs可以作为备选的通信机制。
WebHdfs方式

HA 方式只能填处于active的NameNode节点

webhdfs_url=http://sun75:50070/webhdfs/v1 

修改core-site.xml

配置 Hue 为其他用户和组的代理用

<!-- HUE -->
<property>
    <name>hadoop.proxyuser.hue.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hue.groups</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hadoop.hosts</name>  # 这玩意不配,打死都起不来
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hadoop.groups</name> # 这玩意不配,打死都起不来
    <value>*</value>
</property>

修改hdfs-site.xml

<!-- HUE -->
<property>
    <!-- HDFS Web 服务 -->
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>
<property>
    <name>dfs.permissions.enabled</name>
    <value>false</value>
</property>
HttpFs方式

必须在Hadoop集群启动HttpFs服务

webhdfs_url=http://sun75:14000/webhdfs/v1 

在WebHdfs的基础上修改 httpfs-site.xml 文件

<property>
  <name>httpfs.proxyuser.hue.hosts</name>
  <value>*</value> 
</property>
<property> 
	<name>httpfs.proxyuser.hue.groups</name> 
	<value>*</value> 
</property>

集成Yarn

本次配置为ResourceManager HA环境

 # Configuration for YARN (MR2)
  # ------------------------------------------------------------------------
  [[yarn_clusters]]

    [[[default]]]
      # Enter the host on which you are running the ResourceManager
      #resourcemanager_host=sun46 # ha环境无需

      # The port where the ResourceManager IPC listens on
      #resourcemanager_port=8032  # ha环境无需

      # Whether to submit jobs to this cluster
      submit_to=True

      # Resource Manager logical name (required for HA)
      logical_name=rm1

      # Change this if your YARN cluster is Kerberos-secured
      ## security_enabled=false

      # URL of the ResourceManager API
      resourcemanager_api_url=http://sun46:8088

      # URL of the ProxyServer API
      proxy_api_url=http://sun46:8088

      # URL of the HistoryServer API
      history_server_api_url=http://sun54:18888

      # URL of the Spark History Server
      ## spark_history_server_url=http://localhost:18088

      # Change this if your Spark History Server is Kerberos-secured
      ## spark_history_server_security_enabled=false

      # In secure mode (HTTPS), if SSL certificates from YARN Rest APIs
      # have to be verified against certificate authority
      ## ssl_cert_ca_verify=True

    # HA support by specifying multiple clusters.
    # Redefine different properties there.
    # e.g.

     [[[ha]]]
      # Resource Manager logical name (required for HA)
      logical_name=rm2

      # Un-comment to enable
      submit_to=True

      # URL of the ResourceManager API
      resourcemanager_api_url=http://sun54:8088
      history_server_api_url=http://sun54:19888

非HA环境

 [[yarn_clusters]]

    [[[default]]]
      # Enter the host on which you are running the ResourceManager
      resourcemanager_host=sun46

      # The port where the ResourceManager IPC listens on
      resourcemanager_port=8032

      # Whether to submit jobs to this cluster
      submit_to=True

      # Resource Manager logical name (required for HA)
      logical_name=yarn-cluster

      # URL of the ResourceManager API
      resourcemanager_api_url=http://sun46:8088

      # URL of the ProxyServer API
      proxy_api_url=http://sun46:8088

      # URL of the HistoryServer API
      history_server_api_url=http://sun46:18888

测试命令

hadoop  jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount 10 100

等待yarn任务结束后,登陆Hue平台查看

非hue用户没有权限搜索

image-20231124115912501

集成Hbase

注意:Hue只支持HBase thrift1。如果使用的是thrift2会出现问题

默认HBase thrift server端口号为9090

官方配置:https://docs.gethue.com/administrator/configuration/connectors/#hbase

修改hbase-site.xml

<property>
  <name>hbase.regionserver.thrift.http</name>
  <value>true</value>
</property>
<property>
  <name>hbase.thrift.support.proxyuser</name>
  <value>true</value>
</property>

修改core-site.xml

<property>
  <name>hadoop.proxyuser.hbase.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.hbase.groups</name>
  <value>*</value>
</property>

修改Hue配置

[hbase]
  # Comma-separated list of HBase Thrift servers for clusters in the format of '(name|host:port)'.
  # Use full hostname. If hbase.thrift.ssl.enabled in hbase-site is set to true, https will be used otherwise it will use http
  # If using Kerberos we assume GSSAPI SASL, not PLAIN.
  hbase_clusters=(Cluster|sun54:9090)

  # HBase configuration directory, where hbase-site.xml is located.
  hbase_conf_dir=/srv/dstore/1.0.0.0/hbase/conf

  # Hard limit of rows or columns per row fetched before truncating.
  ## truncate_limit = 500

  # Should come from hbase-site.xml, do not set. 'framed' is used to chunk up responses, used with the nonblocking server in Thrift but is not supported in Hue.
  # 'buffered' used to be the default of the HBase Thrift Server. Default is buffered when not set in hbase-site.xml.
  thrift_transport=buffered

  # Choose whether Hue should validate certificates received from the server.
  ## ssl_cert_ca_verify=true

测试

image-20231124152521868

集成Hive

[beeswax]
  hive_server_host=sun46
  hive_server_port=10000
  hive_conf_dir=/srv/dstore/1.0.0.0/hive/conf
  thrift_version=11

集成S3

[aws]
  # Enable the detection of an IAM role providing the credentials automatically. It can take a few seconds.
  ## has_iam_detection=false

  [[aws_accounts]]
    # Default AWS account
    [[[default]]]
      # AWS credentials
      	 access_key_id=AKIAIOSFODNN7EXAMPLE
       	 secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
      ## security_token=

      # Execute this script to produce the AWS access key ID.
      ## access_key_id_script=/path/access_key_id.sh

      # Execute this script to produce the AWS secret access key.
      ## secret_access_key_script=/path/secret_access_key.sh

      # Allow to use either environment variables or
      # EC2 InstanceProfile to retrieve AWS credentials.
       allow_environment_credentials=false

      # AWS region to use, if no region is specified, will attempt to connect to standard s3.amazonaws.com endpoint
        region=us-east-1

      # Endpoint overrides
        # host=

      # Endpoint overrides
       proxy_address=192.168.2.177
       proxy_port=9878
      ## proxy_user=
      ## proxy_pass=

      # Secure connections are the default, but this can be explicitly overridden:
       is_secure=false

      # The default calling format uses https://<bucket-name>.s3.amazonaws.com but
      # this may not make sense if DNS is not configured in this way for custom endpoints.
      # e.g. Use boto.s3.connection.OrdinaryCallingFormat for https://s3.amazonaws.com/<bucket-name>
      calling_format=boto.s3.connection.OrdinaryCallingFormat

      # The time in seconds before a delegate key is expired. Used when filebrowser/redirect_download is used. Default to 4 Hours.
      ## key_expiry=14400

集成Spark

img

两个连接器可用于与 SparkSql Thrift Server 通信

Apache Livy

Apache Livy 是一项服务,可通过 REST 接口轻松与 Spark 群集进行交互。它使 提交 Spark 作业或 Spark 代码片段、同步或异步结果检索以及 Spark 上下文管理,全部通过简单的 REST 接口或 RPC 客户端库进行。

image.png

安装Livy

livy安装与部署_livy部署-CSDN博客

wget https://dlcdn.apache.org/incubator/livy/0.8.0-incubating/apache-livy-0.8.0-incubating_2.11-bin.zip --no-check-certificate

错误

{"msg":"Rejected, Reason: requirement failed: Cannot find Livy REPL jars."}

需要重新编译Livy以支持spark3.1

GitHub - apache/incubator-livy: Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

编译所需环境

  • mvn (from package or maven3 tarball)maven
  • java-1.8.0-openjdk (or Oracle JDK 8)
  • Python 2.7+
  • R 3.x

注意点

python3.8

ln -s /usr/bin/python3.8 /usr/bin/python  # livy会默认找/usr/bin/python
pip uninstall setuptools
pip install setuptools==39.1.0

R环境

yum -y install R

编译

mvn clean package -Pspark3	  -Pscala-2.12  -DskipTests 

image-20231208160527187

安装包路径

/usr/local/src/incubator-livy-master/assembly/target
apache-livy-0.9.0-incubating-SNAPSHOT_2.12-bin.zip

安装成功后,浏览器访问http:localhost:8998,如图

image-20231211114347813

上面有一行是applicationid,是因为后续的测试提交的任务,点击后会进入到spark的webui

image-20231211114549661

livy支持两种两种任务提交方案:
1、交互式:说白了,就是把原本在spark-shell里面执行的语句,通过http请求发送到livy服务端,然后livy在服务端开启spark-shell执行你传过来的语句;
2、批处理式:说白了,帮你做spark-submit的工作,同样通过http请求吧参数发到livy服务端。

交互式使用方法

工具采用postman,首先新建一个session(其实就是开启一个spark application):请求方式post,请求URL为 livy-ip:8998/sessions
请求体类似如下格式:

image-20231211114007897

就是利用postman发送一个post请求,请求体是json格式,json中可以指定appName,申请多少内存,申请多少cpu等信息
然后会受到响应。

可以看到,开启了一个session,他的id=0
接下来,我们可以向该session发送scala语句进行执行,请求方式post,请求URL为 livy-ip:8998/sessions/{id}/statements

image-20231211114126886

可以看到,这里的id=0,是指该statments的id
然后在发送另一个请求来查看输出,请求方式get,请求url为livy-ip:8998/sessions/1/statements/0

image-20231211114243993

其中,output中的data就是输出的具体数据,progress是指进度,有时候查出来progress小于1,说明任务还没执行完。

前提是Apache livy启动

Connectors :: Hue SQL Assistant Documentation (gethue.com)

修改配置

[[[pyspark]]]
name=PySpark
interface=livy

[[[sparksql]]]
name=SparkSql
interface=livy

[[[spark]]]
name=Scala
interface=livy

[[[r]]]
name=R
interface=livy
[spark]
# The Livy Server URL.
livy_server_url=http://localhost:8998
[desktop]
app_blacklist=

[notebook]
show_notebooks=true

重启Hue,测试

curl -X POST --data '{"kind": "pyspark", "proxyUser": "bob"}' -H "Content-Type: application/json" localhost:8998/sessions

{"id":0,"state":"starting","kind":"pyspark","proxyUser":"bob","log":[]}

在hue webui查看任务执行情况

image-20231211154240236

集成Spark SQL

方式一 、Thrift Server

Apache Livy 通过 Hue 提交一些 Spark SQL。Livy 最初是在 Hue 项目中创建的,并提供交互式或批处理 PySpark / Scala Spark /SparkSql 语句的轻量级提交。

然而,一个主要缺点是,它可能看起来不如 Spark 中附带的分布式 SQL 引擎(也称为“Thrift Server”)正式,Hue 可以通过两个接口连接到 Spark SQL

Thrift Server:

  • SqlAlchemy:基于通用 Python 库的连接器
  • HiveServer2:Hue 的 Hive 原生连接器

貌似需要重新编译Spark,暂时不做

Error: Failed to load class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
Failed to load main class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
You need to build Spark with -Phive and -Phive-thriftserver.

方式二、Apach livy

已经实现

连接器 :: Hue SQL Assistant Documentation (gethue.com)

[spark]
# The Livy Server URL.
livy_server_url=http://localhost:8998
[[[sparksql]]]
name=Spark SQL
interface=livy

集成Ozone

https://gethue.com/blog/discover-the-power-of-apache-ozone-using-the-hue-file-browser/

Hue 5.0版本支持Ozone

Ozone1.4版本支持HttpFs

由Hue官网可知,需要使用HttpFs来操作Ozone,而Ozone在1.4版本才支持HttpFs,所以如果想要测试,需要编译Hue以及Ozone

查看Ozone最新提交记录

image-20231128163454225

需要重新编译Ozone

最好再Linux服务上编译,如果失败可以重试几次

git clone https://github.com/apache/ozone.git
mvn clean package -DskipTests 

image-20231129170756602

安装源码包

hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT.tar.gz

需要额外启动httpfs

Hue配置

[[ozone]]
    [[[default]]]
        fs_defaultfs=ofs://[***SERVICE-ID***]
        webhdfs_url=http://[***OZONE-HTTPFS-HOST***]:[***OZONE-HTTPFS-PORT***]/webhdfs/v1
        ssl_cert_ca_verify=true
        security_enabled=true

Flink SQL 网关,以便能够通过 Hue 编辑器提交 SQL 查询

Flink SQL 网关目前仅支持 Apache Flink 1.10

https://docs.gethue.com/administrator/configuration/connectors/#apache-flink

集成ES

pip install elasticsearch-dbapi==0.2.4 # Python2选择较老版本

   [[[es]]]
     name = Elastic Search
     interface=sqlalchemy
     options='{"url": "elasticsearch+http://192.168.2.75:9200"}'

集成Mysql

./build/env/bin/pip install mysqlclient
[[[mysql]]]
name=MySQL
interface=sqlalchemy
options='{"url": "mysql://root:root@localhost:3306/hue"}'

集成Oracle

./build/env/bin/pip install cx_Oracle
[[[oracle]]]
    name = Oracle
    interface=sqlalchemy
    options='{"url": "oracle://system:123456@192.168.2.239:1521/ORCL"}'

问题1

DatabaseError: DPI-1047: Cannot locate a 64-bit Oracle Client library: “libclntsh.so: cannot open shared object file: No such file or directory”. See https://oracle.github.io/odpi/doc/installation.html#linux for help
mkdir -p /opt/oracle
cd /opt/oracle
wget https://download.oracle.com/otn_software/linux/instantclient/193000/instantclient-basic-linux.x64-19.3.0.0.0dbru.zip
unzip instantclient-basic-linux.x64–19.3.0.0.0dbru.zip
sudo sh -c “echo /opt/oracle/instantclient_19_3 > /etc/ld.so.conf.d/oracle-instantclient.conf”
sudo ldconfig
export LD_LIBRARY_PATH=/opt/oracle/instantclient_19_3:$LD_LIBRARY_PATH

问题2

(cx_Oracle.DatabaseError) ORA-12504: TNS:listener was not given the SERVICE_NAME in CONNECT_DATA (Background on this error at: http://sqlalche.me/e/4xp6)

没有配置SERVICE_NAME

集成PG

./build/env/bin/pip install psycopg2
  or
./build/env/bin/pip install psycopg2-binary
 [[[postgresql]]]
       name = postgresql
       interface=sqlalchemy
       options='{"url": "postgresql+psycopg2://postgres:Etl123456@192.168.2.223:5432/cvs"}' # csv为数据库

参考

https://docs.gethue.com/releases/release-notes-4.11.0/

https://blog.csdn.net/coderookie_run/article/details/108373170

https://blog.csdn.net/qq_25275355/article/details/124860691

https://blog.csdn.net/Meepoljd/article/details/122554037

https://dandelioncloud.cn/article/details/1419660854806777858

https://blog.csdn.net/weixin_34023982/article/details/86123778

https://blog.csdn.net/liweihope/article/details/122869416 #集成

https://demo.dandelioncloud.cn/article/details/1584360560164851713 # 集成

Yarn

  • https://gethue.com/hadoop-tutorial-yarn-resource-manager-high-availability-ha-in-mr2/

Oracle

  • 使用 Python 从 Azure 服务连接 SQL Server、Oracle、MySQL 和 PostgreSQL |由 Srijan Sahay |中等 (medium.com)