使用 Terraform 在 AWS 上创建 ShardingSphere Proxy 高可用集群

SphereEx · 2023 年1 月 31 日 07:42

背景

Terraform

Terraform[1] 是一个 Hashicorp[2] 开源的基础设施自动化编排工具，使用 IaC(基础设施即代码) 的理念来管理基础设施的变更，并得到了 AWS，GCP，AZURE 等公有云厂商的支持以及社区提供的各种各样的 provider，已成为 “基础设施即代码” 领域最流行的实践方式之一。

Terraform 有以下优点：

支持多云部署

Terraform 适用于多云方案，将类似的基础结构部署到阿里云、其他云提供商或者本地数据中心。开发人员能够使用相同的工具和相似的配置文件同时管理不同云提供商的资源。

自动化管理基础架构

Terraform 能够创建模块，可重复使用，从而减少因人为因素导致的部署和管理错误。

基础架构即代码

可以用代码来管理维护资源，允许保存基础设施状态，从而使用户能够跟踪系统~~（~~ ~~基础设施即代码~~ ~~）~~中不同组件所做的更改，并与其他人共享这些配置。

ShardingSphere-Proxy

Apache ShardingSphere 是一款分布式的数据库生态系统，可以将任意数据库转换为分布式数据库，并通过数据分片、弹性伸缩、加密等能力对原有数据库进行增强。

Apache ShardingSphere 设计哲学为 Database Plus，旨在构建异构数据库上层的标准和生态。它关注如何充分合理地利用数据库的计算和存储能力，而并非实现一个全新的数据库。它站在数据库的上层视角，关注数据库之间的协作多于它们自身。

ShardingSphere-Proxy 的定位为透明化的数据库代理，理论上支持任何使用 MySQL、PostgreSQL、openGauss 协议的客户端操作数据，对异构语言、运维场景更友好。

ShardingSphere-Proxy 对应用代码是无侵入的，用户只需更改数据库的连接串，就可以实现数据分片，读写分离等功能，作为数据基础设施的一部分，其自身的高可用性将非常重要。

使用 Terraform 部署

我们希望您通过 IaC 的方式去部署管理 ShardingSphere Proxy 集群，去享受 IaC 带来的好处。

基于以上，我们计划使用 Terraform 创建一个多可用区的 ShardingSphere-Proxy 高可用集群。

在开始编写 Terraform 配置之前，我们先需要了解 ShardingSphere-Proxy 集群的基本架构图

其中我们使用 ZooKeeper 来作为 Governance Center。

可以看出，ShardingSphere-Proxy 自身是一个无状态的应用，在实际场景中，对外提供一个负载均衡即可，由负载均衡去弹性分配各个实例之间的流量。

为了保证 ZooKeeper 集群及 ShardingSphere-Proxy 集群的高可用，我们将使用以下架构创建：

ZooKeeper 集群

定义输入参数

为了达到可重用配置的目的，我们定义了一系列的变量，内容如下：

variable "cluster_size" {
  type        = number
  description = "The cluster size that same size as available_zones"
}

variable "key_name" {
  type        = string
  description = "The ssh keypair for remote connection"
}

variable "instance_type" {
  type        = string
  description = "The EC2 instance type"
}

variable "vpc_id" {
  type        = string
  description = "The id of VPC"
}

variable "subnet_ids" {
  type        = list(string)
  description = "List of subnets sorted by availability zone in your VPC"
}

variable "security_groups" {
  type        = list(string)
  default     = []
  description = "List of the Security Group, it must be allow access 2181, 2888, 3888 port"
}


variable "hosted_zone_name" {
  type        = string
  default     = "shardingsphere.org"
  description = "The name of the hosted private zone"
}

variable "tags" {
  type        = map(any)
  description = "A map of zk instance resource, the default tag is Name=zk-${count.idx}"
  default     = {}
}

variable "zk_version" {
  type        = string
  description = "The zookeeper version"
  default     = "3.7.1"
}

variable "zk_config" {
  default = {
    client_port = 2181
    zk_heap     = 1024
  }

  description = "The default config of zookeeper server"
}

这些变量也可以在下面安装 ShardingSphere-Proxy 集群时更改。

配置 ZooKeeper 集群

ZooKeeper 服务的实例我们使用了 aws 原生的 amzn2-ami-hvm镜像。

我们使用了 count 参数来部署 ZooKeeper 服务，它指示 Terraform 创建的 ZooKeeper 集群的节点数量为var.cluster_size。

在创建 ZooKeeper 实例时，我们使用了 ignore_changes 参数来忽略人为的更改 tag ，以避免在下次运行 Terraform 时实例被重新创建。

我们使用 cloud-init 来初始化 ZooKeeper 相关配置，具体内容见[3]。

我们为每个 ZooKeeper 服务都创建了对应的域名，应用只需要使用域名即可，以避免 ZooKeeper 服务重启导致 ip 地址更改带来的问题。

data "aws_ami" "base" {
  owners = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-ebs"]
  }

  most_recent = true
}

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_network_interface" "zk" {
  count           = var.cluster_size
  subnet_id       = element(var.subnet_ids, count.index)
  security_groups = var.security_groups
}

resource "aws_instance" "zk" {
  count         = var.cluster_size
  ami           = data.aws_ami.base.id
  instance_type = var.instance_type
  key_name      = var.key_name

  network_interface {
    delete_on_termination = false
    device_index          = 0
    network_interface_id  = element(aws_network_interface.zk.*.id, count.index)
  }

  tags = merge(
    var.tags,
    {
      Name = "zk-${count.index}"
    }
  )

  user_data = base64encode(templatefile("${path.module}/cloud-init.yml", {
    version     = var.zk_version
    nodes       = range(1, var.cluster_size + 1)
    domain      = var.hosted_zone_name
    index       = count.index + 1
    client_port = var.zk_config["client_port"]
    zk_heap     = var.zk_config["zk_heap"]
  }))

  lifecycle {
    ignore_changes = [
      # Ignore changes to tags.
      tags
    ]
  }
}

data "aws_route53_zone" "zone" {
  name         = "${var.hosted_zone_name}."
  private_zone = true
}

resource "aws_route53_record" "zk" {
  count   = var.cluster_size
  zone_id = data.aws_route53_zone.zone.zone_id
  name    = "zk-${count.index + 1}"
  type    = "A"
  ttl     = 60
  records = element(aws_network_interface.zk.*.private_ips, count.index)
}

定义输出

在成功运行 terraform apply 后会输出 ZooKeeper 服务实例的 IP 及对应的域名。

output "zk_node_private_ip" {
  value       = aws_instance.zk.*.private_ip
  description = "The private ips of zookeeper instances"
}

output "zk_node_domain" {
  value       = [for v in aws_route53_record.zk.*.name : format("%s.%s", v, var.hosted_zone_name)]
  description = "The private domain names of zookeeper instances for use by ShardingSphere Proxy"
}

ShardingSphere-Proxy 集群

定义输入参数

定义输入参数的目的也是为了达到配置可重用的目的。

variable "cluster_size" {
  type        = number
  description = "The cluster size that same size as available_zones"
}

variable "shardingsphere_proxy_version" {
  type        = string
  description = "The shardingsphere proxy version"
}

variable "shardingsphere_proxy_asg_desired_capacity" {
  type        = string
  default     = "3"
  description = "The desired capacity is the initial capacity of the Auto Scaling group at the time of its creation and the capacity it attempts to maintain. see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-group.html#cfn-as-group-desiredcapacitytype, The default value is 3"
}

variable "shardingsphere_proxy_asg_max_size" {
  type        = string
  default     = "6"
  description = "The maximum size of ShardingSphere Proxy Auto Scaling Group. The default values is 6"
}

variable "shardingsphere_proxy_asg_healthcheck_grace_period" {
  type        = number
  default     = 120
  description = "The amount of time, in seconds, that Amazon EC2 Auto Scaling waits before checking the health status of an EC2 instance that has come into service and marking it unhealthy due to a failed health check. see https://docs.aws.amazon.com/autoscaling/ec2/userguide/health-check-grace-period.html"
}

variable "image_id" {
  type        = string
  description = "The AMI id"
}

variable "key_name" {
  type        = string
  description = "the ssh keypair for remote connection"
}

variable "instance_type" {
  type        = string
  description = "The EC2 instance type"
}

variable "vpc_id" {
  type        = string
  description = "The id of your VPC"
}

variable "subnet_ids" {
  type        = list(string)
  description = "List of subnets sorted by availability zone in your VPC"
}

variable "security_groups" {
  type        = list(string)
  default     = []
  description = "List of The Security group IDs"
}

variable "lb_listener_port" {
  type        = string
  description = "lb listener port"
}

variable "hosted_zone_name" {
  type        = string
  default     = "shardingsphere.org"
  description = "The name of the hosted private zone"
}

variable "zk_servers" {
  type        = list(string)
  description = "The Zookeeper servers"
}

配置 ShardingSphere-Proxy 集群

配置 AutoScalingGroup

我们将创建一个 AutoScalingGroup 来让其管理 ShardingSphere-Proxy 实例，AutoScalingGroup 的健康检查类型被更改为 “ELB”, 在负载均衡对实例执行健康检查失败后，AutoScalingGroup 能够及时移出坏的节点。

在创建 AutoScallingGroup 时会忽略 load_balancers 和 target_group_arns的更改。

我们同样使用 cloud-init 来配置 ShardingSphere-Proxy 实例，具体内容见[4]。

resource "aws_launch_template" "ss" {
  name                                 = "shardingsphere-proxy-launch-template"
  image_id                             = var.image_id
  instance_initiated_shutdown_behavior = "terminate"
  instance_type                        = var.instance_type
  key_name                             = var.key_name
  iam_instance_profile {
    name = aws_iam_instance_profile.ss.name
  }

  user_data = base64encode(templatefile("${path.module}/cloud-init.yml", {
    version    = var.shardingsphere_proxy_version
    version_elems = split(".", var.shardingsphere_proxy_version)
    zk_servers = join(",", var.zk_servers)
  }))

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 1
    instance_metadata_tags      = "enabled"
  }

  monitoring {
    enabled = true
  }

  vpc_security_group_ids = var.security_groups

  tag_specifications {
    resource_type = "instance"

    tags = {
      Name = "shardingsphere-proxy"
    }
  }
}

resource "aws_autoscaling_group" "ss" {
  name                      = "shardingsphere-proxy-asg"
  availability_zones        = data.aws_availability_zones.available.names
  desired_capacity          = var.shardingsphere_proxy_asg_desired_capacity
  min_size                  = 1
  max_size                  = var.shardingsphere_proxy_asg_max_size
  health_check_grace_period = var.shardingsphere_proxy_asg_healthcheck_grace_period
  health_check_type         = "ELB"

  launch_template {
    id      = aws_launch_template.ss.id
    version = "$Latest"
  }

  lifecycle {
    ignore_changes = [load_balancers, target_group_arns]
  }
}

配置负载均衡

上一步创建好的 AutoScalingGroup 会 attach 到负载均衡上，经过负载均衡的流量会自动路由到 AutoScalingGroup 创建的 ShardingSphere-Proxy 实例上。

resource "aws_lb_target_group" "ss_tg" {
  name               = "shardingsphere-proxy-lb-tg"
  port               = var.lb_listener_port
  protocol           = "TCP"
  vpc_id             = var.vpc_id
  preserve_client_ip = false

  health_check {
    protocol = "TCP"
    healthy_threshold = 2
    unhealthy_threshold = 2
  }

  tags = {
    Name = "shardingsphere-proxy"
  }
}

resource "aws_autoscaling_attachment" "asg_attachment_lb" {
  autoscaling_group_name = aws_autoscaling_group.ss.id
  lb_target_group_arn    = aws_lb_target_group.ss_tg.arn
}


resource "aws_lb_listener" "ss" {
  load_balancer_arn = aws_lb.ss.arn
  port              = var.lb_listener_port
  protocol          = "TCP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.ss_tg.arn
  }

  tags = {
    Name = "shardingsphere-proxy"
  }
}

配置域名

我们将创建默认为 proxy.shardingsphere.org 的内部域名，实际内部指向到上一步创建的负载均衡。

data "aws_route53_zone" "zone" {
  name         = "${var.hosted_zone_name}."
  private_zone = true
}

resource "aws_route53_record" "ss" {
  zone_id = data.aws_route53_zone.zone.zone_id
  name    = "proxy"
  type    = "A"

  alias {
    name                   = aws_lb.ss.dns_name
    zone_id                = aws_lb.ss.zone_id
    evaluate_target_health = true
  }
}

配置 CloudWatch

我们将通过 STS 去创建包含 CloudWatch 权限的角色，角色会附加到由 AutoScalingGroup 创建的 ShardingSphere-Proxy 实例上。

ShardingSphere-Proxy 的运行日志会被 CloudWatch Agent 采集到 CloudWatch 上。默认会创建名为 shardingsphere-proxy.log 的 log_group。

CloudWatch 的具体配置见[5]。

resource "aws_iam_role" "sts" {
  name = "shardingsphere-proxy-sts-role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "ss" {
  name = "sharidngsphere-proxy-policy"
  role = aws_iam_role.sts.id

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "cloudwatch:PutMetricData",
        "ec2:DescribeTags",
        "logs:PutLogEvents",
        "logs:DescribeLogStreams",
        "logs:DescribeLogGroups",
        "logs:CreateLogStream",
        "logs:CreateLogGroup"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
EOF
}

resource "aws_iam_instance_profile" "ss" {
  name = "shardingsphere-proxy-instance-profile"
  role = aws_iam_role.sts.name
}

部署

在创建完所有的 Terraform 配置后就可以部署 ShardingSphere-Proxy 集群了。在实际部署之前，推荐您使用如下命令去检查配置是否按预期执行。

terraform plan

在确认完计划后，就可以去真正的执行了，运行如下命令

terraform apply

完整的代码可以在 [6] 找到。更多的内容请查看我们的网站[7]。

测试

测试的目标是证明创建的集群是可用的，我们使用一个简单 case：使用 DistSQL 添加两个数据源及创建一个简单的分片规则，然后插入数据，查询能返回正确的结果。

默认我们会创建一个 proxy.shardingsphere.org 的内部域名， ShardingSphere-Proxy 集群的用户名和密码都是 root。

说明：

DistSQL（Distributed SQL）是 Apache ShardingSphere 特有的操作语言。它与标准 SQL 的使用方式完全一致，用于提供增量功能的 SQL 级别操作能力，详细说明见[8]。

总结

Terraform 是一个帮助你实现 IaC 的非常有用的工具，使用 Terraform 对迭代 ShardingSphere-Proxy 集群将非常有用。希望这篇文章能够帮助到对 ShardingSphere 以及 Terraform 感兴趣的人。

引用

SphereEx · 2023 年1 月 31 日 07:42

SphereEx · 2023 年2 月 2 日 03:15