Table of Contents
- Before you Begin
- What is container orchestration and why do I need it?
- Terminology is a barrier. Kubernetes objects explained
- Kubernetes Architecture Deep Dive
Before You Begin
Before you begin this walkthrough, ensure you are logged onto the PKS desktop by following the instructions found here.
What is container orchestration and why do I need it?
Developers aren't quite sure how to operationalize all of these disparate container workloads, but they do know that automated orchestration is the key.
What does that mean?
Containers need to be distributed across container hosts in a way that levels the use of host resources. Virtual Machine placement on vSphere hosts can be handled by the Distributed Resource Scheduler (DRS). A similar capability is needed for containers. The physical resources need isolation capability - the ability to define availability zones or regions. Affinity and anti-affinity become important as well. Some container workloads must run in close proximity to others for performance - or to provide availability, must run on separate physical hosts.
The ecosystem of tools available to the operations team today tend to stop at the host operating system - without providing views into the containers themselves. These tools are becoming available, but are not yet widely adopted. Monitoring of running container applications and recovery upon failure must be addressed. Container images need to be managed. Teams need a mechanism for image isolation, such as role based access control and signing of content. Image upgrade and rollout to running applications must be addressed.
Orchestration must also include the capability to scale the application up or down to provide for changes in resource consumption or availability requirements.
Containers are ephemeral. They are short lived and are expected to die. When they restart or are recreated, how do other applications find them? Service Discovery is critical to operationalizing containers at scale. Service Endpoints need to be redundant and support Load Balancing. They should also auto scale as workloads increase.
Kubernetes is an open-source platform for automating deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure.
• Deploy your applications quickly and predictably
• Scale your applications on the fly
• Seamlessly roll out new features
• Optimize use of your hardware by using only the resources you need
Terminology is a barrier. Kubernetes objects explained
Many people new to the container space and Kubernetes get hung up on all of the new terminology. Before jumping into the details of the platform, we are going to spend a little time defining some of the terms that will be used later on to describe the function of the platform.
The goal is to provide some level of depth on these topics, however if you find that this is more than you need, skip to Module 2 and start using Kubernetes and Pivotal Container Service(PKS).
A cluster is very simply the physical or virtual machines and other infrastructure resources used by Kubernetes to run your applications. You prepare a set of machines, create networking and attach storage, then install the Kubernetes system services. Now you have a running cluster. This does not mean that there is any sort of traditional clustering technology in the infrastructure sense - nor does it align with vSphere clustering constructs. That has been a point of confusion for many VMware administrators. A cluster is simply a set of VMs, wired together, with attached local or shared storage - and running the Kubernetes System services.
A node is any of the physical machines or VMs that make up the Kubernetes cluster. Nodes are of two types: Master (sometimes called Leader) and Worker. Some Master based services can be broken out into their own set of VMs and would also be referred to as nodes (we will get to Etcd shortly). Master nodes run the kube-system services. The Worker nodes run an agent and networking proxy, but are primarily thought of as the set of nodes that run the pods.
Pods are the smallest deployable units of computing that can be created and managed in Kubernetes. Pods are always co-located and co-scheduled, and run in a shared context. A pod models an application-specific logical host - it contains one or more application containers which are relatively tightly coupled. The shared context of a pod is a set of Linux namespaces, cgroups, and potentially other facets of isolation - the same things that isolate a Docker container.
In this sample pod, there are three application containers. The Nginx webserver, along with ssh and logging daemons. In a non-container deployment, all three of these would probably run as individual processes on a single VM. Containers generally run a single process to keep them lightweight and avoid the need for init configuration. Notice in the image that there is also a Pause container. This container actually hosts the networking stack, the other three containers will share the IP and listen on different ports. This allows all containers in a pod to communicate via localhost. Notice that the pod in this example has a single IP: 10.24.0.2 on a network that is generally private to the Kubernetes cluster. The pod is a logical abstraction that is managed by Kubernetes. If you log onto a Kubernetes node VM and look for pods, you won't find them through Docker. You will be able to see a set of containers, but no pods. You will find the pods through the Kubernetes CLI or UI.
A Replica Set ensures that a specified number of pod replicas are running at any given time. A replication controller process watches the current state of pods and matches that with the desired state specified in the pod declaration. If there is a difference, because a pod has exited, it attempts to make the desired state and current state consistent by starting another pod. Developers may choose to define replica sets to provide application availability and/or scalability. This definition is handled through a configuration file defined in yaml or json syntax.
Kubernetes pods are ephemeral. They are created and when they die, they are recreated - not restarted. While each pod gets its own IP address, even those IP addresses cannot be relied upon to be stable over time. This leads to a problem: if some set of pods - like Redis slave (Redis is a Key/Value store with Master/Slave architecture) - provides functionality to other pods - like a frontend Webserver - inside the Kubernetes cluster, how do those frontends find and keep track of which backends are in that set?
A Kubernetes Service is an abstraction which defines a logical set of pods and a policy by which to access them - sometimes called a micro-service. The set of pods targeted by a service is (usually) determined by a label selector (Explained on the next page). A service generally defines a ClusterIP and port for access and provides East/West Load Balancing across the underlying pods.
Let's look at this in the context of the diagram above. There are two Redis-slave pods - each with its own IP (10.24.0.5, 10.24.2.7). When the service is created, it is told that all pods with the label Redis-slave are part of the service. The IPs are updated in the endpoints object for the service. Now when another object references the service (through either the service clusterIP (172.30.0.24) or its DNS entry, it can load balance the request across the set of pods.
Kubernetes includes its own DNS for internal domain lookups and each service has a record based on its name (redis-slave).
To this point we have only talked about internal access to the service. What if the service is a web server and users must access it from outside the cluster. Remember that the IPs aren't routable outside the private cluster overlay network. In that case there are several options - Ingress Servers, North/South Load Balancing, and NodePort. In the service declaration, a specification of type NodePort means that each cluster node will be configured so that a single port is exposed for this service. So a user could get access to the frontend web service in the diagram by specifying the IP address of any node in the cluster, along with the NodePort for the frontend service. The service then provides East/West load balancing across the pods that make up the service. In our lab we are using NSX to provide the networking. NSX provides the capability to define a Load Balancer which will proxy access to the underlying Services.
Labels and Selectors
The esoteric definition is as follows:
• Key:Value pairs that can be attached to any Kubernetes object (pods, nodes, services)
• Ex: Identify releases (Beta, Prod), Environments (Dev, Prod), Tiers (Frontend, Backend)
• Selectors are the mechanism for group Filtering based on the labels
A more straightforward way to say this is Kubernetes is architected to take action on sets of objects. The sets of objects that a particular action might occur on are defined through labels. We just saw one example of that where a service knows the set of pods associated with it because a selector (like run:redis-slave) was defined on it and a set of pods was defined with a label of run:redis-slave. This methodology is used throughout Kubernetes to group objects.
A deployment is a declarative object for defining your desired Kubernetes application state. It includes the number of replicas and handles the roll-out of application updates. deployments provide declarative updates for pods and replica sets (the next-generation replication controller). You only need to describe the desired state in a deployment object, and the deployment controller will change the actual state to the desired state at a controlled rate for you. Think of it as a single object that can, among other things, define a set of pods and the number of replicas, while supporting upgrade/rollback of pod image versions.
Namespaces are intended for use in environments with many users spread across multiple teams, or projects. Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces. They are a way to divide cluster resources between multiple uses. As Kubernetes continues to evolve, namespaces will provide true multi-tenancy for your cluster. They are only partially there at this point. By default, all resources in a Kubernetes cluster are created in a default namespace. A pod will run with unbounded CPU and memory requests/limits. A Kubernetes Namespace allows users to partition created resources into a logically named group. Each namespace provides:
• a unique scope for resources to avoid name collisions
• policies to ensure appropriate authority to trusted users
• ability to specify constraints for resource consumption
This allows a Kubernetes cluster to share resources by multiple groups and provide different levels of QoS to each group. Resources created in one namespace are hidden from other namespaces. Multiple namespaces can be created, each potentially with different constraints.
You will see how namespaces are used in Module 2.
Load balancing in Kubernetes can be a bit of a confusing topic. The Kubernetes cluster section shows an image with load balancers. Those represent balancing requests to the Kubernetes control plane. Specifically the API Server. But what if you deploy a set of pods and need to load balance access to them? We have previously discussed services. In addition to discovery, services also provide load balancing of requests across the set of pods that make up the service. This is known as East/West load balancing and is internal to the cluster. If there is a need for ingress to a service from an external network, and a requirement to load balance that access, this is known as North/South load balancing. There are three primary implementation options:
• Create service with type ‘LoadBalancer’. This is platform dependent and requires that the load balancer distributing inbound traffic is created through an external load balancer service. NSX provides load balancing for clusters deployed through Pivotal Container Service (PKS).
• Statically configure an external load balancer (Like F5) that sends traffic to a K8s Service over ‘NodePort’ on specific nodes. In this case, the configuration is done directly on the external load balancer after the service is created and the nodeport is known.
• Create Kubernetes Ingress ; This is a Kubernetes object that describes a North/South load balancer. The Kubernetes ingress object is ’watched’ by an ingress controller that configures the load balancer datapath. Usually both the ingress controller and the load balancer datapath are running as pods. This requires that an ingress controller be created, but may be the most flexible solution. NSX-T provides an ingress controller.
Sample Restaurant Rating Application
This simple application captures votes for a set of restaurants, provides a running graphical tally and captures the number of page views. It contains four separate deployments- UI, Application Server, Postgres DB and Redis caching server. A deployment provides a declarative method for defining pods, replica sets and other Kubernetes constructs. The UI Deployment includes a UI pod, which runs an Nginx Webserver. It uses a replica set that maintains three running copies of the UI pod. It also defines a UI service that provides an abstraction to the underlying UI pods, including a ClusterIP and Load Balancer that can be used to access the service.
The application is using a Redis Key:Value store to capture page views and a Postgres Database to persist the vote . Let's now dig into the configuration files that would be needed to define this application.
The files for creating the deployments and their services can be in yaml or json format. Usually yaml is used because it is easier to read. Below are the yaml files used to create the UI deployment and the UI service. The other yaml files are available as part of module 4.
This file defines the deployment specification. Think of it as the desired state for the deployment. It has a name - yelb-ui. It defines a replica set that includes 1 replica. That means the desired state for this deployment is that 1 copy of the pod is running. Labels are defined for these pods. You will see below that the service definition will use these to define the pods that are covered by the service. The container in the pod will be based on the harbor.corp.local/library/restreview- i:V1 image. Resources can be constrained for the container based on the requests Key. Lastly, the container will be listening on port 80.
Remember that this is container port 80 and must be mapped to some host port in order to access it from an external network.
This file defines the UI service specification. The important pieces are the Type: LoadBalancer and the Selector. Specifying Type: LoadBalancer means that NSX will associate a Load Balancer with this service to provide external access to the application. The service will then route requests to one of the pods that has a label from the service's selector. So all pods with matching labels will be included in this service. Note: NSX actually changes the routing mechanism from what is described here, but it logically works this way.
Kubernetes Architecture Deep Dive
At a very high level, the Kubernetes cluster contains a set of Master services that may be contained in a single VM or broken out into multiple VMs. The Master includes the Kubernetes API, which is a set of services used for all internal and external communications. Etcd is a distributed key value store that holds all persistent meta data for the Kubernetes cluster. The scheduler is a Master service that is responsible for scheduling container workloads onto the Worker nodes. Worker nodes are VMs that are placed across ESXi hosts. Your applications run as a set of containers on the worker nodes. Kubernetes defines a container abstraction called a pod, which can include one or more containers. Worker nodes run the Kubernetes agent, called Kubelet, which proxies calls to the container runtime daemon (Docker or others) for container create/stop/start/etc. etcd provides an interesting capability for "Watches" to be defined on its data so that any service that must act when meta data changes simply watches that key:value and takes its appropriate action.
Kubernetes Master Nodes
A Kubernetes cluster can have one or more master VMs and generally will have etcd deployed redundantly across three VMs.
Target for all operations to the data model. External API clients like the Kubernetes CLI client, the dashboard Web-Service, as well as all external and internal components interact with the API Server by ’watching’ and ‘setting’ resources
Monitors container (pod) resources on the API Server, and assigns Worker nodes to run the pods based on filters
Embeds the core control loops shipped with Kubernetes. In Kubernetes, a controller is a control loop that watches the shared state of the cluster through the API Server and makes changes attempting to move the current state towards the desired state
Is used as the distributed key-value store of Kubernetes
In etcd and Kubernetes everything is centered around ‘watching’ resources. Every resource can be watched on etcd through the API Server
Kubernetes Worker Nodes
The Kubelet agent on the nodes is watching for ‘PodSpecs’ to determine what it is supposed to run and Instructs container runtimes to run containers through the container runtime API interface. PodSpecs are defined through the yaml configuration files seen earlier.
Docker: Is the most used container runtime in Kubernetes. However K8s is ‘runtime agnostic’, and the goal is to support any runtime through a standard interface (CRI-O)
Besides Docker, Rkt by CoreOS is the most visible alternative, and CoreOS drives a lot of standards like CNI and CRI-O (Check out https://www.cncf.io/ for more on these standards)
Is a daemon watching the K8s ‘services’ on the API Server and implements east/ west load-balancing on the nodes using NAT in IPTables
Let's look at a sample workflow. This is a high level view and may not represent the exact workflow, but is a close approximation. A user wants to create a pod through the CLI, UI or using the API through their own code. The request comes to the Kubernetes API Server. The API Server instantiates a pod object and updates etcd with the information. The scheduler is watching for pod objects that have no node associated with it. The scheduler sees the new pod object and goes through its algorithm for finding a node to place the pod (available resources, node selector criteria, etc.). Scheduler updates the pod information (through the API Server) to include the placement node. On that node, Kubelet is watching etcd for a pod object that contains its node. Once it sees the new pod object, it begins to instantiate the pod. Kubelet will call the container runtime engine to instantiate the set of containers that make up the pod. Once the pod is running and has an IP address, that information is updated in etcd so that the new Endpoint can be found.
Now that you know a little about how Kubernetes works, move on to Module 2 and see how to deploy and manage your clusters with Pivotal Container Service (PKS).