搭建hadoop伪分布式集群总结

环境清单

前言：整个伪分布式集群搭建于虚拟机中，宿主主机为Mac OS big sur。

宿主主机MacOS Big Sur
虚拟机软件VMware Fusion
虚拟机系统Centos8 Minimal

软件清单

hadoop 3.3.0
zookeeper 3.7.0
hbase 2.4.2
spark 3.1.1
kafka 2.12_2.7.0
jdk 1.8.0_281
scala 2.12.8

系统软件

yum install -y wget net-tools \
                    openssh openssh-server \
                    openssh-clients passwd telnet

节点信息

192.168.0.114    master
192.168.0.169    slave1
192.168.0.156    slave2
192.168.0.155    slave3

需要将节点信息配置到/etc/hosts文件中，便于后续直接使用名称访问。每个节点机器都需要配置

配置ssh免密登录

安装虚拟机时，我默认指定的是root账号，配置hadoop集群，各个节点统一新建一个bigdata账号。

1.设置账号，每个节点都创建。

useradd bigdata #创建账号

passwd bigdata #设置访问密码

2.创建访问公钥，为了hadoop生态软件能够免密访问，设置bigdata账户的免密配置即可。

[root@master]$ su bigdata

[bigdata@master]$ ssh-keygen -t rsa #这个操作会在~/.ssh/下生成密钥配置

[bigdata@master]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys #如果没有authorized_keys文件，先手动新建

[bigdata@master]$ ssh bigdata@master #测试是否配置成功，若成功，则不需要输入密码

[bigdata@master]$ chmod -R 700 ~/.ssh/ #如果提示没有访问publickey权限，需要授于访问权限

# 依次拷贝到其余slave节点
[bigdata@master]$ ssh-copy-id -f -i ~/.ssh/id_rsa.pub bigdata@slave1
[bigdata@master]$ ssh-copy-id -f -i ~/.ssh/id_rsa.pub bigdata@slave2
[bigdata@master]$ ssh-copy-id -f -i ~/.ssh/id_rsa.pub bigdata@slave3

上面的操作是为了实现：

master免密访问slave1
master免密访问slave2
master免密访问slave3

配置JDK

前言：保证各个节点的文件结构一致，在根目录下新建bigdata目录并授予bigdata账户访问(执行chown -R bigdata:bigdata /bigdata命令)。

# 搭建完成后，文件结构大致如下：
[bigdata@master bigdata]$ tree -L 1 /bigdata
/bigdata
├── hadoop-3.3.0
├── hadoop-3.3.0.tar.gz
├── hbase-2.4.2
├── hbase-2.4.2-bin.tar.gz
├── jdk-8u281-linux-x64.tar.gz
├── kafka_2.12-2.7.0
├── kafka_2.12-2.7.0.tgz
├── scala-2.12.8.rpm
├── spark-3.1.1-bin-hadoop3.2
├── spark-3.1.1-bin-hadoop3.2.tgz
├── zookeeper-3.7.0-bin
└── zookeeper-3.7.0-bin.tar.gz

1.解压jdk文件到执行目录

[bigdata@master bigdata]$ tar -zxvf jdk-8u281-linux-x64.tar.gz -C /usr/local

2.配置环境变量
因为配置项较多，所以单独新建一个配置文件/etc/hadoop，便于管理。

[bigdata@master]$ touch /etc/hadoop

3.将下面的内容配置到/etc/hadoop文件中：

# /etc/profile

export JAVA_HOME=/usr/local/jdk1.8.0_281

export PATH=$PATH:$JAVA_HOME/bin

4.将/etc/hadoop配置到系统配置文件，使之生效。在/etc/profile文件最后加入以下内容：

# hadoop ecosystem config
. /etc/hadoop

5.刷新环境变量

[bigdata@master]$ source /etc/profile

6.测试配置是否生效：

[bigdata@master]$ java -version
java version "1.8.0_281"
Java(TM) SE Runtime Environment (build 1.8.0_281-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.281-b09, mixed mode)

7.将jdk-8u281-linux-x64.tar.gz文件分别拷贝到slave1,slave2,slave3节点，并重复上面的配置步骤，完成jdk配置。

[bigdata@master]$ scp jdk-8u281-linux-x64.tar.gz bigdata@slave1:/bigdata/

[bigdata@master]$ scp jdk-8u281-linux-x64.tar.gz bigdata@slave2:/bigdata/

[bigdata@master]$ scp jdk-8u281-linux-x64.tar.gz bigdata@slave3:/bigdata/

配置hadoop

1.解压hadoop

[bigdata@master bigdata]$ tar -zxvf hadoop-3.3.0.tar.gz

2.配置环境变量/etc/hadoop

# /etc/profile

export JAVA_HOME=/usr/local/jdk1.8.0_281
export HADOOP_HOME=/bigdata/hadoop-3.3.0

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

3.刷新环境变量

[bigdata@master bigdata]$ source /etc/profile

4.修改hadoop配置
4.1.hadoop-3.3.0/etc/hadoop/hdfs-site.xml配置文件

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
      <name>dfs.namenode.http-address</name>
      <value>master:9870</value>
    </property>
    <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>slave1:50090</value>
    </property>
</configuration>

4.2.hadoop-3.3.0/etc/hadoop/core-site.xml配置文件

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/bigdata/hadoop-3.3.0/data</value>
    </property>
</configuration>

注意：如果/bigdata/hadoop-3.3.0目录下没有data和logs目录，需要手动创建。

4.3./bigdata/hadoop-3.3.0/etc/hadoop/hdfs-env.sh配置文件

export JAVA_HOME=/usr/local/jdk1.8.0_281

4.4./bigdata/hadoop-3.3.0/etc/hadoop/mapred-site.xml配置文件

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
    </property>
</configuration>

4.5./bigdata/hadoop-3.3.0/etc/hadoop/workers配置文件

slave1
slave2
slave3

5.将整个hadoop-3.3.0文件夹分别拷贝到slave1,slave2,slave3节点

[bigdata@master]$ scp -r hadoop-3.3.0 bigdata@slave1:/bigdata/

[bigdata@master]$ scp -r hadoop-3.3.0 bigdata@slave2:/bigdata/

[bigdata@master]$ scp -r hadoop-3.3.0 bigdata@slave3:/bigdata/

6.分别在各个节点执行hdfs初始化命令

[bigdata@master]$ hdfs namenode -format

[bigdata@slave1]$ hdfs namenode -format

[bigdata@slave2]$ hdfs namenode -format

[bigdata@slave3]$ hdfs namenode -format

7.在主节点上启动hadoop

[bigdata@master]$ ./hadoop-3.3.0/sbin/start-all.sh #这条命令会依次启动master及slave节点的服务

8.查看服务状态

[bigdata@master bigdata]$ jps
30033 Jps
6908 NameNode
7292 ResourceManager

当看到上面内容后，表示服务启动成功。在从节点上执行jps命令，会看到：

2069 DataNode
5498 Jps
2190 NodeManager
xxx SecondaryNameNode <- 因为配置原因，slave1上会多一个二级namenode。

9.访问hadoop的web界面
9.1.Namenode管理界面，地址：http://192.168.0.114:9870
Namenode information
9.2.Nodes管理界面，地址：http://192.168.0.114:8088/cluster/nodes
Nodes of the cluster

以上hadoop伪分布式集群即配置成功。

接下来，我们开始配置hbase。

=======================================

整个篇幅较长，可以冲杯咖啡提提神。：）

=======================================

配置hbase

1.解压hbase文件

[bigdata@master]$ tar -zxvf hbase-2.4.2-bin.tar.gz

2.配置环境变量/etc/hadoop

# /etc/profile

export JAVA_HOME=/usr/local/jdk1.8.0_281
export HADOOP_HOME=/bigdata/hadoop-3.3.0
export HBASE_HOME=/bigdata/hbase-2.4.2
export CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin

3.刷新配置文件

[bigdata@master]$ source /etc/profile

4.修改hbase配置文件/bigdata/hbase-2.4.2/conf/hbase-env.sh

export JAVA_HOME=/usr/local/jdk1.8.0_281
export HBASE_MANAGES_ZK=true # <- 使用hbase自带的zookeeper

5.修改hbase配置文件/bigdata/hbase-2.4.2/conf/hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-->
<configuration>
  <!--
    The following properties are set for running HBase as a single process on a
    developer workstation. With this configuration, HBase is running in
    "stand-alone" mode and without a distributed file system. In this mode, and
    without further configuration, HBase and ZooKeeper data are stored on the
    local filesystem, in a path under the value configured for `hbase.tmp.dir`.
    This value is overridden from its default value of `/tmp` because many
    systems clean `/tmp` on a regular basis. Instead, it points to a path within
    this HBase installation directory.

    Running against the `LocalFileSystem`, as opposed to a distributed
    filesystem, runs the risk of data integrity issues and data loss. Normally
    HBase will refuse to run in such an environment. Setting
    `hbase.unsafe.stream.capability.enforce` to `false` overrides this behavior,
    permitting operation. This configuration is for the developer workstation
    only and __should not be used in production!__

    See also https://hbase.apache.org/book.html#standalone_dist
  -->
  <property>
      <name>hbase.rootdir</name>
      <value>hdfs://master:9000/hbase</value>
      <description> hbase.rootdir是RegionServer的共享目录，用于持久化存储HBase数据，默认写入/tmp中。如果不修改此配置，在HBase重启时，数据会丢失。此处一般设置的是hdfs的文件目录，如NameNode运行在namenode.Example.org主机的9090端口，则需要设置为hdfs://namenode.example.org:9000/hbase
      </description>
  </property>
  <property>
      <name>hbase.cluster.distributed</name>
      <value>true</value>
      <description>此项用于配置HBase的部署模式，false表示单机或者伪分布式模式，true表不完全分布式模式。
      </description>
  </property>
  <property>
      <name>hbase.tmp.dir</name>
      <value>/bigdata/hbase-2.4.2/tmp</value>
  </property>
  <property>
      <name>hbase.zookeeper.quorum</name>
      <value>master,slave1,slave2,slave3</value>
      <description>此项用于配置ZooKeeper集群所在的主机地址。
      </description>
  </property>
  <property>
      <name>hbase.zookeeper.property.clientPort</name>
      <value>2181</value>
  </property>
  <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/bigdata/hbase-2.4.2/data</value>
      <description>此项用于设置存储ZooKeeper的元数据，如果不设置默认存在/tmp下，重启时数据会丢失。
      </description>
  </property>
  <property>
    <name>hbase.wal.provider</name>
    <value>filesystem</value>
  </property>
  <property>
    <name>hbase.unsafe.stream.capability.enforce</name>
    <value>false</value>
  </property>
</configuration>

6.修改hbase配置文件/bigdata/hbase-2.4.2/conf/regionservers

slave1
slave2
slave3

7.新建/bigdata/hbase-2.4.2/data文件夹和/bigdata/hbase-2.4.2/logs文件夹(如果不存在的话)，然后新建/bigdata/hbase-2.4.2/data/myid文件，在myid文件中写入内容0。
8.将整个hbase-2.4.2文件夹拷贝到各个节点

[bigdata@master]$ scp -r hbase-2.4.2 bigdata@slave1:/bigdata/

[bigdata@master]$ scp -r hbase-2.4.2 bigdata@slave2:/bigdata/

[bigdata@master]$ scp -r hbase-2.4.2 bigdata@slave3:/bigdata/

9.依次将slave1,slave2,slave3节点上的hbase-2.4.2/data/myid内容修改为1,2,3
10.启动hbase服务

[bigdata@master bigdata]$ ./hbase-2.4.2/bin/start-hbase.sh # 此命令会依次启动节点上的hbase服务

11.使用jps命令查看hbase服务状态，节点上会出现HRegionServer服务名称，master上出现HMaster服务

让hbase使用独立的zookeeper服务

通过以上配置，hbase还是使用的自带的zookeeper服务，现在我们配置独立的zookeeper服务供hbase使用。
1.首先，停止hbase服务

[bigdata@master bigdata]$ ./hbase-2.4.2/bin/stop-hbase.sh # 此命令会依次停止节点上的hbase服务

2.修改hbase配置文件/bigdata/hbase-2.4.2/conf/hbase-env.sh

export HBASE_MANAGES_ZK=false

3.配置zookeeper
3.1.解压zookeeper文件

[bigdata@master bigdata]$ tar -zxvf zookeeper-3.7.0-bin.tar.gz

3.2.配置/bigdata/zookeeper-3.7.0/conf/zoo.cfg配置文件(此文件不存在，需要从/bigdata/zookeeper-3.7.0-bin/conf/zoo_sample.cfg复制)

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just 
# example sakes.
# dataDir=/tmp/zookeeper
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the 
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

## Metrics Providers
#
# https://prometheus.io Metrics Exporter
#metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
#metricsProvider.httpPort=7000
#metricsProvider.exportJvmInfo=true
dataDir=/bigdata/zookeeper-3.7.0-bin/data
dataLogDir=/bigdata/zookeeper-3.7.0-bin/logs
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
server.4=slave3:2888:3888

3.3.配置zookeeper环境变量/etc/hadoop

# /etc/profile

export JAVA_HOME=/usr/local/jdk1.8.0_281
export HADOOP_HOME=/bigdata/hadoop-3.3.0
export ZK_HOME=/bigdata/zookeeper-3.7.0-bin
export HBASE_HOME=/bigdata/hbase-2.4.2
export CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$ZK_HOME/bin

3.4.刷新环境变量

[bigdata@master bigdata]$ source /etc/profile

3.5.新建/bigdata/zookeeper-3.7.0-bin/data文件夹和/bigdata/zookeeper-3.7.0-bin/logs(如果不存在的话)，新建/bigdata/zookeeper-3.7.0-bin/data/myid文件，在myid文件中写入值1。

3.6.将整个zookeeper-3.7.0-bin文件夹拷贝到各个节点

[bigdata@master]$ scp -r zookeeper-3.7.0-bin bigdata@slave1:/bigdata/

[bigdata@master]$ scp -r zookeeper-3.7.0-bin bigdata@slave2:/bigdata/

[bigdata@master]$ scp -r zookeeper-3.7.0-bin bigdata@slave3:/bigdata/

3.7.依次将slave1,slave2,slave3节点上的zookeeper-3.7.0-bin/data/myid内容修改为2,3,4
3.8.分别在各个节点上启动服务

[bigdata@master bigdata]$ ./zookeeper-3.7.0-bin/bin/zkServer.sh start

[bigdata@slave1 bigdata]$ ./zookeeper-3.7.0-bin/bin/zkServer.sh start

[bigdata@slave2 bigdata]$ ./zookeeper-3.7.0-bin/bin/zkServer.sh start

[bigdata@slave3 bigdata]$ ./zookeeper-3.7.0-bin/bin/zkServer.sh start

3.9.在各个节点上查看服务状态

[bigdata@master bigdata]$ ./zookeeper-3.7.0-bin/bin/zkServer.sh status

[bigdata@slave1 bigdata]$ ./zookeeper-3.7.0-bin/bin/zkServer.sh status

[bigdata@slave2 bigdata]$ ./zookeeper-3.7.0-bin/bin/zkServer.sh status

[bigdata@slave3 bigdata]$ ./zookeeper-3.7.0-bin/bin/zkServer.sh status

4.启动hbase服务

[bigdata@master bigdata]$ ./hbase-2.4.2/bin/start-hbase.sh # 此命令会依次启动节点上的hbase服务

5.使用jps查看服务状态，出现QuorumPeerMain表示hbase使用独立zookeeper成功

6.使用hbase shell命令测试hbase服务

[bigdata@master bigdata]$ hbase shell
hbase:002:0> status
1 active master, 0 backup masters, 3 servers, 0 dead, 0.3333 average load
Took 0.1664 seconds

7.访问hbase的web界面，地址：http://192.168.0.114:16010
Master

安装scala

步骤参考知乎文章centos 8 安装scala，此处不再赘述。

安装spark

1.解压spark文件

[bigdata@master bigdata]$ tar -zxvf spark-3.1.1-bin-hadoop3.2.tgz

2.配置/bigdata/spark-3.1.1-bin-hadoop3.2/conf/log4j.properties(此文件不存在，需要从/bigdata/spark-3.1.1-bin-hadoop3.2/conf/log4j.properties.template复制)

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.sparkproject.jetty=WARN
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

3.配置/bigdata/spark-3.1.1-bin-hadoop3.2/conf/spark-env.sh(此文件不存在，需要从/bigdata/spark-3.1.1-bin-hadoop3.2/conf/spark-env.sh.template复制)

#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Options for launcher
# - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y")

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_LOG_MAX_FILES Max log files of Spark daemons can rotate to. Default is 5.
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS
export JAVA_HOME=/usr/local/jdk1.8.0_281
export SCALA_HOME=/usr/share/scala
export HADOOP_HOME=/bigdata/hadoop-3.3.0
export SPARK_HOME=/bigdata/spark-3.1.1-bin-hadoop3.2
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=512M
export SPARK_MASTER_IP=master

4.配置/bigdata/spark-3.1.1-bin-hadoop3.2/conf/workers(此文件不存在，需要从/bigdata/spark-3.1.1-bin-hadoop3.2/conf/workers.template复制)

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.
slave1
slave2
slave3

5.将/bigdata/spark-3.1.1-bin-hadoop3.2整个文件夹拷贝到slave1,slave2,slave3
6.启动spark服务

[bigdata@master bigdata]$ ./spark-3.1.1-bin-hadoop3.2/sbin/start-all.sh #此命令会依次启动各个节点的服务

7.使用jps命令查看服务，master上显示Master，slave上显示Worker，则表示服务运行正常。

配置kafka

1.解压kafka文件

[bigdata@master bigdata]$ tar -zxvf kafka_2.12-2.7.0.tgz

2.修改/etc/kafka_2.12-2.7.0/config/server.properties配置文件

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from 
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set, 
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

# A comma separated list of directories under which to store log files
#log.dirs=/tmp/kafka-logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
# zookeeper.connect=localhost:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=18000


############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0


log.dirs=/bigdata/kafka_2.12-2.7.0/logs
zookeeper.connect=master:2181,slave1:2181,slave2:2181,slave3:2181
listensers=PLAINTEXT://localhost:9092
# master ip
advertised.listeners=PLAINTEXT://192.168.0.114:9092

3.将/bigdata/kafka_2.12-2.7.0整个文件夹拷贝到各个节点

[bigdata@master bigdata]$ scp -r kafka_2.12-2.7.0 bigdata@slave1:/bigdata/

[bigdata@master bigdata]$ scp -r kafka_2.12-2.7.0 bigdata@slave2:/bigdata/

[bigdata@master bigdata]$ scp -r kafka_2.12-2.7.0 bigdata@slave3:/bigdata/

注意：需要分别将server.properties的advertised.listeners配置为所在机器的ip地址，将server.properties中的broker.id依次设置为1,2,3。

4.依次启动各个节点的kafka服务

[bigdata@master bigdata]$ ./kafka_2.12-2.7.0/bin/kafka-server-start.sh ./kafka_2.12-2.7.0/config/server.properties

[bigdata@slave1 bigdata]$ ./kafka_2.12-2.7.0/bin/kafka-server-start.sh ./kafka_2.12-2.7.0/config/server.properties

[bigdata@slave2 bigdata]$ ./kafka_2.12-2.7.0/bin/kafka-server-start.sh ./kafka_2.12-2.7.0/config/server.properties

[bigdata@slave3 bigdata]$ ./kafka_2.12-2.7.0/bin/kafka-server-start.sh ./kafka_2.12-2.7.0/config/server.properties

注意：上面的启动方式会在前台运行。若要后台常驻运行，可以使用nohup启动。

[bigdata@master bigdata]$ nohup ./kafka_2.12-2.7.0/bin/kafka-server-start.sh ./kafka_2.12-2.7.0/config/server.properties &

5.使用jps查看kafka服务状态，出现Kafka则表示服务启动成功
6.创建topic

[bigdata@master bigdata]$ ./kafka_2.12-2.7.0/bin/kafka-topics.sh --create --zookeeper master:2181,slave1:2181,slave2:2181,slave3:2181 --replication-factor 1 --partitions 1 --topic test

7.生产端写入数据，此命令会进入生产消息终端，输入数据回车即生成消息

[bigdata@master bigdata]$ ./kafka_2.12-2.7.0/bin/kafka-console-producer.sh --broker-list master:9092,slave1:9092,slave2:9092,slave3:9092 --topic test

8.消费端消费消息

[bigdata@master bigdata]$ ./kafka_2.12-2.7.0/bin/kafka-console-consumer.sh --bootstrap-server master:9092 --topic test --from-beginning

9.查看topic信息

[bigdata@master bigdata]$ ./kafka_2.12-2.7.0/bin/kafka-topics.sh --list --zookeeper master:2181

[bigdata@master bigdata]$ ./kafka_2.12-2.7.0/bin/kafka-topics.sh --zookeeper master:2181 --describe --topic test

至此，环境基本配置完毕。
每次启动集群时，需要注意启动顺序：

hadoop -> zookeeper -> hbase -> spark -> kafka

尽可能不要暴力的关闭虚拟机，因为可能会引发意想不到的问题。

上面的软件清单打包到了百度云盘https://pan.baidu.com/s/1HgDbr_ajY9LXHnM7nNqWTQ，提取码33tp

转载请注明原文地址：https://blog.keepchen.com/a/set-up-pseudo-distributed-hadoop.html