All Contributors

ICT Infrastructures - University of Pisa (Italy)

Since there is only little material on ICT Infrastructures course, this is a recap and summary of classes. The notes are a compilation of the course contents and focus on the topics in accordance with Prof. Antonio Cisternino's OneNote Notebook.
It is highly recommended to study with the EMC DELL slides provided under <<_Raccolta contenuto>> which will not be uploaded here for copyright reasons. Each heading correspond to a module. If you find any error please, fork and submit a pull request!

Table of contents

Click to show or hide

Introduction

The ICT world is changing (and will keep changing beyond the last time these notes were updated) and a lot of axioms about its infrastructures are becoming outdated. A couple of examples:

Cloud Computing Reference Model [Module 2]

Since the course revolves around Cloud Computing architectures, it is important to keep the following reference model of the cloud stack in mind:

  1. Physical Layer [Module 3]: Foundation layer of the cloud infrastructure.
    The physical infrastructure supporting the operation of the cloud
  2. Virtual Layer [Module 4]: Abstracts physical resources and makes them appear as virtual resources. e.g. a physical server is partitioned into many virtual ones to use the hardware better. The High Performance Computing model bypasses the virtual layer for performance reasons.
  3. Control Layer [Module 5]: Dynamic Resource configuration and allocation.
  4. Orchestration Layer [Module 6]: workflows for task automation.
  5. Service Layer [Module 6]: self-service portal/interface and service catalog. Allows cloud users to obtain the resources they need without knowing where they are allocated.
  6. Service Management [Module 9]: on operational and business level
  7. Business Continuity [Module 7]: Enables ensuring the availability of services in line with SLAs.
    e.g. Backups vs Replicas: doing a backup of 1 PB may be a problem.
    Fault Tolerance: I should be able to power off a server without anyone noticing it.
    live migration: upgrading the software or the firmware while the system is running.
  8. Security [Module 8]: Governance, Risk and compliance. Also things like GDPR, phishing, antivirus, firewalls and DoS Attacks..

Data centers

We start the course with datacenter design, see how it is built to support current and future design considerations, scalability, etc.

A data center is a facility used to house computer systems and associated components, such as telecommunications and storage systems. It generally includes redundant or backup components and infrastructure for power supply, data communications connections, environmental controls (e.g. air conditioning, fire suppression) and various security devices. A large data center is an industrial-scale operation using as much electricity as a small town.

On average there are only 6 person managing 1 million servers.
Prefabricated group of racks, already cabled and cooled, are automatically inserted in the datacenter (POD - Point Of Delivery). If something is not working in the prefabricated, the specific server is shut down. If more than the 70% is not working the POD producer will simply change the entire unity.

The datacenter is a place where we concentrate IT system in order to reduce costs. Servers are demanding in terms of current, cooling and security.

Design and Architectures

Cooling

Today cooling is air based. Just the beginning for liquid cooling.
The air pushed though the server gets a 10/15 degrees temperature augment.

CRAC: Computer Room Air Conditioner

Popular in the '90 (3-5KW/rack), but not very efficient in terms of energy consumption.
There is a floating floor, under which all the cabling and the cooling is performed. The air goes up because of thermal convection where it gets caught, cooled and re-introduced.

Drawbacks are density (if we want to go dense this approach fails) and the absence of locality. No one is using this technique today.

Hot/Cold aisles

The building block of this architecture are hot and cold corridors, with servers front-to-front and back-to-back; that optimize cooling efficiency.

The workload balancing may be a problem: there can be the situation where a rack is hotter than the other depending on the workload, thus is difficult to module the amount of hot and cold air. In the CRAC model the solution is pumping enough for the higher consumer, but is not possible to act only where needed. That leads waste of energy. This problem is not present in the in-row cooling technology.

In-Row cooling

In-row cooling technology is a type of air conditioning system commonly used in data centers (15-60 kW/rack) in which the cooling unit is placed between the server cabinets in a row for offering cool air to the server equipment more effectively.

In-row cooling systems use a horizontal airflow pattern utilizing hot aisle/cold aisle configurations and they only occupy one-half rack of row space without any additional side clearance space. Typically, each unit is about 12 inches wide by 42 inches deep.

These units may be a supplement to raised-floor cooling (creating a plenum to distribute conditioned air) or may be the primary cooling source on a slab floor.

The in-row cooling unit draws warm exhaust air directly from the hot aisle, cools it and distributes it to the cold aisle. This ensures that inlet temperatures are steady for precise operation. Coupling the air conditioning with the heat source produces an efficient direct return air path; this is called close coupled cooling, which also lowers the fan energy required. In-row cooling also prevents the mixing of hot and cold air, thus increasing efficiency.

It's possible to give more cooling to a single rack, modulating the air needed. In front of the rack there are temperature and humidity sensors. Humidity should be avoided because can condensate because of the temperature differences and therefore conduct electricity.
There are systems collecting data from the sensors and adjusting the fans. The racks are covered to separate cool air and hot air. It's also possible to optimize the datacenter cooling according to the temperature changes of the region where the datacenter is and apply "static analysis" to the datacenter location, in order to optimize resource consumption according to temperature changes. Programs are available in order to simulate airflows in datacenter in order to optimize the fans.

Usually every 2 racks (each 70 cm) there should be a cooling row (30 cm).

Liquid cooling

It's also called CoolIT, consists in making the water flow directly onto the CPUs.
Having water in a data center is a risky business, but this solution lowers the temperature for ~40%. One way of chilling the water could be pushing it down to the ground. Water Distribution System, like the Power Distribution System.

A lot of research has been lately invested towards oil cooling computers, particularly in the contest of High Performance Computing. This is a more secure solution because the mineral oil is not a conductor and allows to immerse everything in the oil, in order to maximize the effectiveness of the cooling. The problem of this technique is that the cables slowly pump the oil out.

Other ideas

A typical approach to cool the air is to place chillers outside the building, or by trying geocooling, which revolves around using the cold air in depth. The main idea is to make a deep hole in the ground, and make the cables pass through it.

Current

A 32KW datacenter is small (also if it consumes the same amount of current of 10 apartments).

For efficiency reasons, Datacenters use Direct current (DC) insted of Alternate current (AC): A DC power architecture contains less components, which means less heat production, hence less energy loss. However, nowadays current is transported via AC, so it is required a conversion to DC using Direct Current Transformers. The Industrial current has 380 Volts in 3 phases. Also, note that Direct current is more dangerous than Alternating current.

Where cosfi is the heat dissemination happening from conversion of AC into DC current, and it is a number <= 1.
It gives the efficiency of the transformation and generally it changes according to the amount of current needed (idle vs under pressure). For example an idle server with 2 CPUs (14 cores each) consumes 140 Watts.

Power Distribution

The amount of current allowed in a data center are the Ampere on the PDU (Power Distribution Unit).

There are one or more lines (for reliability and fault tolerance reasons) coming from different generators to the datacenter (i.e. each line 80 KW , 200 A more or less. Can use it for 6 racks 32A / rack. Maybe I will not use the whole 32 A so I can put more racks).

The lines are attached to an UPS (Uninterruptible Power Supply/Source). It is a rack or half a rack with batteries (not enough to keep-on the servers) that in some cases can power the DC for ~20 minutes. Them are also used to prevent current oscillation. There are a Control Panel and a Generator. When the power lines fail the UPS is active between their failure and the starting of the generator and ensure a smooth transition during the energy source switching. The energy that arrives to the UPS should be divided among the servers and the switches.

The UPS is attached to the PDU (Power Distribution Unit) which is linked to the server. For redundancy reasons, a server is powered by a pair of lines, that usually are attached to two different PDU. The server uses both the lines, so that there will be continuity in case of failure of a line. In the server there are the power plugs in a row that can monitored via a web server running on the rack PDU.

Example of rack PDU: 2 banks, 12 plugs each, 16 A each bank, 15 KW per rack, 42 servers per rack.

Power factor

Click to expand

Alternating current (AC) supplies our buildings and equipment. AC is more efficient for power companies to deliver, but when it hits the equipment's transformers, it exhibits a characteristic known as reactance.

Reactance reduces the useful power (watts) available from the apparent power (volt-amperes). The ratio of these two numbers is called the power factor (PF). Therefore, the actual power formula for AC circuits is watts = volts x amps x power factor. Unfortunately, the PF is rarely stated for most equipment, but it is always a number of 1.0 or less, and about the only thing with a 1.0 PF is a light bulb.

For years, large UPS systems were designed based on a PF of 0.8, which meant that a 100 kVA UPS would only support 80 kW of real power load.

The majority of large, commercial UPS systems are now designed with a PF of 0.9. This recognizes that most of today's computing technology presents a PF of between 0.95 and 0.98 to the UPS. Some UPS systems are even designed with PFs of 1.0, which means the kVA and kW ratings are identical (100 kVA = 100 kW). However, since the IT load never presents a 1.0 PF, for these UPS systems, the actual load limit will be the kVA rating.

Use the hardware manufacturers' online configurations if possible. As a last resort, use the server's power supply rating -- a server with a 300-Watt power supply can never draw 800 Watts. Size the power systems based on real demand loads.

Dual-corded equipment adds redundancy to IT hardware, and the lines share power load. If a dual-corded server has two 300-Watt power supplies, it can still draw no more than 300 Watts in your power design, because each power supply has to be able to handle the server's full load (not including power supply efficiency calculations).

The other way to estimate total server power consumption is to use industry norms. Unless you're hosting high performance computing, you can probably figure groupings in three levels of density: Low density cabinets run 3.5 to 5 kW; medium density run 5 to 10 kW; high density run 10 to 15 kW. The amount of each rack type to allocate depends on your operation. Generally, data centers operate with about 50% low density cabinets, 35% medium and 15% high density.

If your projected average is more than 1.5 times your existing average, take a closer look at the numbers. This result is fine if you expect a significant density increase, due to new business requirements or increased virtualization onto blade servers. But if there's no apparent reason for such a density growth, re-examine your assumptions.

PUE: Power Usage Effectiveness

PUE is a ratio that describes how efficiently a computer data center uses energy; specifically, how much energy is used by the computing equipment (in contrast to cooling and other overhead).

PUE is the ratio of total amount of energy used by a computer data center facility to the energy delivered to computing equipment. PUE is the inverse of data center infrastructure efficiency (DCIE).

As example, consider that the PUE of the university's datacenter during 2018 is less 1.2, while the average italian data center's PUE are around 2-2.5.

If the PUE is equal to 2 means that for each Watt used for computing, 1 Watt is used for cooling.

Fabric

The fabric is the interconnection between nodes inside a datacenter. We can think this level as a bunch of switch and wires.

We refer to North-South traffic indicating the traffic outgoing and incoming to the datacenter (internet), while we refer to East-West as the internal traffic between servers.

Ethernet

The connection can be performed with various technologies, the most famous is Ethernet, commonly used in Local Area Networks (LAN) and Wide Area Networks (WAN). Ethernet use twisted pair and optic fiber links. Ethernet as some famous features such as 48-bit MAC address and Ethernet frame format that influenced other networking protocols.

MTU (Maximum Transfer Unit) up to 9 KB with the so called Jumbo Frames.
On top of ethernet there are TCP/IP protocols (this is a standard), they introduce about 70-100 micro sec of latency.

The disadvantage of Ethernet is the low reliability.

Infiniband

Even if Ethernet is so famous, there are other standard to communicate. InfiniBand (IB), by Mellanox, is another standard used in high-performance computing (HPC) that features very high throughput and very low latency (about 2 microseconds). InfiniBand is a protocol and a physical infrastructure and it can send up to 2GB messages with 16 priorities level.
The RFC 4391 specifies a method for encapsulating and transmitting IPv4/IPv6 and Address Resolution Protocol (ARP) packets over InfiniBand (IB).

InfiniBand transmits data in packets up to 4KB. A massage can be:

Pros:

RDMA: Remote Direct Memory Access

Access, a direct memory access (really!) from one computer into that of another without involving either one's OS and bypassing the CPU. This permits high-throughput and low-latency networking performing. RDMA can gain this features because is not a protocol, but is on API, hence there is no overhead.

RDMA supports zero-copy networking by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and the data buffers in the operating system, and by bypassing TCP/IP. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer. The main use case is distributed storage.

Omni-Path

Moreover, another communication architecture that exist and is interested to see is Omni-Path. This architecture is owned by Intel and performs high-performance communication(Ompni-Path Wikipedia).
The interest of this architecture is that Intel plans to develop technology based on that will serve as the on-ramp to exascale computing (a computing system capable of the least one exaFLOPS).

Connectors & plugs

Now we try to analyse the problem from the connector point of view. The fastest wire technology available is the optic fiber. It can be divided into two categories:

They also have different transceiver. There are two kind of connectors:

There can be a cable with a LC in one side and a SC on the other side.

Of course, a wire is a wire, and we need something to connect it to somewhere (transceiver):

RJ45 SPF+ QSPF+ transceiver module LC connector

Nowadays we have:

The transceiver module can serve copper or optical fiber; it has a microchip inside and is not cheap.

Software Defined Approach

The Software Defined Approach, where approach is Networking (SDN) or Storage (SDS), is a novel approach to cloud computing.

Software-defined approach abstracts all the infrastructure components (compute, storage, and network), and pools them into aggregated capacity. It separates the control or management functions from the underlying components to the external software, which takes over the control operations to manage the multi-vendor infrastructure components centrally.
This decoupling enable to centralize all data provisioning and management tasks through software, external to the infrastructure components.
The software runs on a centralized compute system or a standalone device, called the software-defined controller.

Benefits of software-defined approach:

SDN: Software Defined Networking

SDN is an architecture purposing to be dynamic, manageablea and cost-effective (SDN Wikipedia). This type of software create a virtual network to manage the network with more simplicity.

The main concept are the following:

There is a flow table in the switches that remembers the connection. The routing policies are adopted according to this table.
Deep packet inspection made by a level 7 firewall. The firewall validates the flow and if it's aware that the flow needs bandwidth, the firewall allows it to bypass the redirection (of the firewall).

Software-defined data center

Software-defined data center is a sort of upgrade of the previous term and indicate a series of virtualization concepts such as abstraction, pooling and automation to all data center resources and services to achieve IT as a service.

Hyper-convergence

So we virtualize the networking, the storage, the data center... and the cloud! Some tools, as Nutanix build the hyper-converged infrastructure HCI technology.

Hyper-converged infrastructure combines common datacenter hardware using locally attached storage resources with intelligent software to create flexible building blocks that replace legacy infrastructure consisting of separate servers, storage networks, and storage arrays.

Network topologies

A way of cabling allowing multiple computers to communicate. It's not necessary a graph,but for the reliability purpose it often realized as a set of connected nodes. At least 10% of nodes should be connected in order to guarantee a sufficient reliability (Small World Theory).

At layer 2 there is no routing table (broadcast domain), even if there are some cache mechanism. The topology is more like a tree than a graph because some edges can be cutted preserving reachability and lowering the costs. In the layer 2 topology computers talk each other, for that reason there is no scalability.
The layer 2 topology is widely used for broadcasting.

At layer 3 there are routing tables, them are keep updated by a third part, the router. The L3 topology is the mainly used for point-to-point communication.

In switches there are routing tables but them are used just for cache, switches working also without routing tables.

Introduction

Small-world theory

This theory, formulated by Watts and Strogatz, claims that 6 hops connect us with every person in the world.
According to their studies, taken two people x and y respectively strangers, x can send a message to y just asking to his acquaintances to pass the message to someone closer to y. Hop by hop, the message reaches y going only through friends of friends. On average, this operation needs only 6 steps.

For this reason, a good network topology should take 6 hops on average to connect 2 machines.
Actually, topologically we got more than 6 hops, but adding 10% of random links across the graph the hops number easly collapse to 6.

Spanning Tree Protocol (STP)

First of all it is necessary to understand the loop problem. A loop is a cycle of the links between various nodes which creates a "DDoS-like" situation by flooding the network.
The spanning Tree Protocol is a network protocol that builds a logical loop-free topology for Ethernet networks. Taken a node as root, it builds a spanning tree from the existing topology graph, and disables all the arch that are not in use. The graph is now totally converted into a tree.

In networking the spanning tree is built using some Bridge Protocol Data Units (BPDUs) packages.
In 2001 the IEEE introduced Rapid Spanning Tree Protocol (RSTP) that provides significantly faster spanning tree convergence after a topology change.

The advantage of the Spanning Tree protocol is that unplugging a link the network will autofix in less than a minute, rebuilding a new tree with the edges previously discarded. However, nowadays it is used only in campus and not in datacenters, due to its high latency of convergence (up to 10-15 seconds to activate a backup line) that is not sufficient for an always-on system.

Network Chassis

The Network Chassis is a sort of big modular and resilient switch. At the bottom it has a pair of power plugs and then it's made of modular line cards (with some kind of ports) and a pair of RPM Routing Processing Modules (for redundancy) to ensure that the line cards work. The chassis can be over provisioned to resist to aging but it has a limit.

Pros

Cons

The chassis is connected with the rack's tor and bor (top/bottom of rack) switches via a double link.

Stacking

Some network switches have the ability to be connected to other switches and operate together as a single unit. These configurations are called stacks, and are useful for quickly increasing the capacity of a network.

It's cheaper than the chassis but there is less redundancy and it is not upgradable without connectivity.

Three-tier design

Simple architecture constisting of core, aggregation and access switches connected in a hierarchy through pathways. Possible loops in those paths are prevented using the Spanning Tree Protocol, which also provides active-passive redundancy: indeed the STP tree keeps only a set of active nodes.

However, this type of redundancy leads to inefficient east-west traffic, because devices connected to the same switch port may contend for bandwidth. Moreover, communication server-to-server might requires long crossings between layers, causing latency and traffic bottlenecks. Hence, the Three-tier design is not good for virtualization, because VMs should be able to freely move between servers without compromises

Spine and leaf Architecture

Architecture suitable for large datacenters and cloud networks due to its scalability, reliability and better performance. It consists of two layers: the spine layer, which is made of switches that perform routing and that work as the backbone of the network, and the leaf layer, which is made of switches that connect to endpoints such as servers, storage devices, firewalls, load balancers and edge routers. Every leaf switch is interconnected to every spine switch of the network fabric. Using this topology, any server can communicate with any other server with no more than one interconnection switch path between any two leaf switches.

It is highly scalable: if the bandwidth is not enough, simply add an additional spine switch and connect it to all the leaf switches (it also reduces oversubscription, which is described next section); if the ports are not enough, simply add a new leaf switch and connect it to all the spine switches

Loops are prevented using the Link Aggregation Control Protocol (LACP): it aggregates two different physical links between two devices into a logical point-to-point link. That means that both links can be used to communicate, increasing the bandwidth and gaining active-active redundancy in case of failure of a link (ensuring no loops because each link is a single channel). Hence, the leaf-spine design provides a more stable and reliable network performance.

LACP also provides a method to control the bundling of several physical ports together to form a single logical channel. The first two ports of every switch are reserved to create a link with a twin switch (a loop is created, but the OS is aware of that and it avoids it). Next ports are the ones used to create links with leaf nodes. The bandwidth is aggregated (i.e. 2*25 Gbps), but it's still capped to 25 Gbps because the traffic goes only from one way to the other each time.

Usually in a spine and leaf architecture the NS traffic, that connect the datacenter to Internet, is slow and the EW traffic, that is server-to-server and rack-to-rack, is very intensive.

Characteristics:

With this architecture it's possible to turn off one switch, upgrade it and reboot it without compromising the network. Half of the bandwidth is lost in the process, but the twin switch keeps the connection alive.

A typical configuration of the ports and bandwidth of the leaves is:

Just a small remark: with spine and leaf we introduce more hops, so more latency, than the chassis approach. The solution for this problem is using as a base of the spine a huge switch (256 ports) which actually acts as a chassis, in order to reduce the number of hops and latency.

Oversubscription

It is the practice of connecting multiple devices to the same switch port to optimize use. For example, it is particulary useful to connect multiple slower devices to a single port to take advantage of the unused capacity of the port and improve its utilization. However, devices and applications that require high bandwidth should generally connect with a switch port 1-on-1, because multiple devices connected to the same switch port may contend for that port's bandwidth, resulting in poor response time. Hence, significant increases in the use of multi-core CPUs, server virtualization, flash storage, Big Data and cloud computing have driven the requirement for modern networks to have lower oversubscription. For this reason, it is important to keep in mind the oversubscription ratio, when designing your fabric.

In a leaf-spine design, this oversubscription is measured as the ratio of downlink ports (to servers/storage) to uplink ports (to spine switches). Current modern network designs have oversubscription ratios of 3:1 or less. For example, if you have 20 servers each connected with 10Gbps downlinks (leaft switches - servers) and 4 10Gbps uplinks (leaf switches - spine switches), you have a 5:1 oversubscription ratio (200Gbps/40Gbps).

Is it possible to achieve a degree of oversubscription equal to 1?
Yes, and it is possible by just linking half the ports upwards and half down. This is the basis of the full fat tree.

Some considerations about numbers

Click to show or hide Start think about real world. We have some server with 1 Gbps (not so high speed, just think that is the speed you can reach with your laptop attaching a cable that is in classroom in the university). We have to connect this servers to each other, using switches (each of them has 48 ports). We have a lots of servers... The computation is done.

As we see we need a lots of bandwidth to manage a lots of service and even if the north-south traffic (the traffic that goes outside from our datacenter) can be relatively small (the university connection exits on the world with 40 Gbps), the east-west traffic (the traffic inside the datacenter) can reach a very huge number of Gbps. Aruba datacenter (called IT1) with another Aruba datacenter (IT2) reach a bandwidth of 82 Gbps of Internet connection.

Full Fat Tree

In this network topology, the link that are nearer the top of the hierarchy are "fatter" (thicker, which means high-bandwidth) than the link further down the hierarchy. Used only in high performance computing where performances have priority over budgets.

The full fat tree resolves the problem of over-subscription. Adopting the spine and leaf there is the risk that the links closer to the spines can't sustain the traffic coming from all the links going from the servers to the leaves. The full fat tree is a way to build a tree so that the capacity is never less than the incoming traffic. It's quite expensive and because of this reason some over subscription can be accepted.

VLAN

Now, the problem is that every switch can be connected to each other and so there is no more LANs separation in the datacenter, every packet can go wherever it wants and some problems may appear. VLANs solve this problem partitioning a broadcast domain and creating isolated computer networks.

A virtual LAN (VLAN) is a virtual network consisting of virtual and/or physical switches, which divides a LAN into smaller logical segments. A VLAN groups the nodes with a common set of functional requirements, independent of the physical location of the nodes. In a multi-tenant cloud environment, the provider typically creates and assigns a separate VLAN to each consumer. This provides a private network and IP address space to a consumer, and ensures isolation from the network traffic of other consumers.

It works by applying tags (from 1 to 4094) to network packets (in Ethernet frame) and handling these tags in the networking systems.

A switch can be configured to accept some tags on some ports and some other tags on some other ports.

VLAN are useful to manage the access control to some resources (and avoid to access to some subnetwork from other subnetwork). Different VLANs are usually used for different purposes.

Switch Anatomy

A switch is an ASIC (Application-Specific Integrated Circuit). It can be proprietary architecture or non-proprietary. There are two type of switches: Layer 2 and Layer 3 switches. The main difference is the routing function: A Layer 2 switch only deals with MAC addresses, while a Layer 3 switch also cares about IP addresses and manages VLAN and Intra-VLAN communications. In both layers there is no loop problem.

Datacenter's switches are usually non-blocking. It basically means that this switches have the forwarding capacity that supports concurrently all ports at full capacity.

Now some standard are trying to impose a common structure to the network elements (switch included) to facilitate the creation of standard orchestration and automation tools.

The internal is made of a control plane which is configurable and a data plane where there are the ports and where the actual switching is made. The control plane evolved during the years, now they run an OS and Intel CPU's. Through a CLI Command Line Interface it's possible to configure the control plane. Some examples of command are:

Some protocols in the switch (bold ones are important):

ONIE (Open Networking Installed Environment) boot loader
The switch has a firmware and two slots for the OS images. When updating in the first slot we store the old OS image, in the second slot the new one.

NFV Network Functions Virtualization (5G mostly NFV based)
The data plane is connected to a DC's VM which acts as a control plane.

Network topology with firewalls

A Firewall can only perform security check on a flow, but cannot manage the flow itself. Furthermore, is not possible to let pass the entire traffic through the Firewall, because it would be a bottleneck. For that reason, after the security checks the firewall divert the flow directly to router and switches thanks to OpenFlow API.

Disks and Storage

IOPS: Input/output operations per second is an input/output performance measurement used to characterize computer storage devices (associated with an access pattern: random or sequential).

System Bus Interfaces

Redundancy

RAID stands for Redundant Array of Independent Disks. The RAID is done by the disk controller or the OS.
The more common RAID configurations are:

Memory Hierarchy

Tiering is a technology that categorizes data to choose different type of storage media to reduce the total storage cost. Tiered storage policies place the most frequently accessed data on the highest performing storage. Rarely accessed data goes on low-performance, cheaper storage.

Caches:

Memory tiering:

Storage tiering:

NVMe

It's a protocol on the PCI-express bus and it's totally controller-less. From the software side it's simpler in this way to talk with the disk because the driver is directly attached to the PCI, there is no controller and minor latency.

A bus is a component where I can attach different devices. It has a clock and some lanes (16 in PCI, ~15 GBps because each lane is slightly less then 1 GB). Four drives are enough to exhaust a full PCI v3 bus. They are also capable of saturating a 100 Gbps link, since a NVMe SSD has a bandwidth of 3.5 GBps (3.5*4 = 14 GBps => almost filled the 15 GBps of the PCI-e).

NVMe has now almost totally replaced SATA, since the latter uses 2 PCIe lines and for that reasons represents the bottleneck considering the actual SSD speed.
Furthermore, NVMe is often uses in the lower memory tier of the RAM: its speed is only one order of magnitude less than RAM, but can have a very big size without any problem. For that reason represent a valid super-fast cache level for the RAM and them started being associated in one single level to implement a big RAM tier, in a totally transparent way for the system.

Since the software latency in disk IOs is 5 microseconds more or less, TCP/IP software introduces also a latency of 70-80 microseconds, the disk is no more a problem. Indeed, the problem is now the network, not only for the latency, but also for the bandwidth: 4 NVMe totally saturates a 100 Gbps network.

nvDIMM

nvDIMM (non volatile Dual Inline Memory Module) is used to save energy. It allows to change the amount of current given to each line, that is as much as a SSD needs to write.

The memory power consumption is a problem, because it usually consume more current than the CPU; moreover the RAM to persists after a reboot needs to be battery-powered, that is very expensive.
With the advent of SSD and NVMe things changed, since we reach high speed with persistent memory: non-volatile memory does not need power unless the need of performing I/O operations; moreover data does not need to be refreshed periodically to avoid data loss.

nvDIMM allows to put SSDs on the memory BUS as for the RAM instead of the PCIe as for the storage.

Misc

Storage aggregation

Actually, the Hard Drive problem is not the speed but the latency. With a large bandwidth HDD are fast on contiguous data, but have a high latency on sparse data, on which are very slow.

Latency is due to:

This problems are solved with the storage aggregation technique, that is a strategy for accessing drives in parallel instead of sequentially.
It is the concept of splitting data between various disks and then "picture" the whole system as a sole huge drive (concept of resource pooling in cloud computing)
The strategy for accessing drive makes the difference.
Fiber channel is the kind of fabric dedicated for the storage. The link coming from the storage ends up in the Host Based Adapter in the server.

Storage system architectures

Storage system architectures are based on data access methods whose common variants are:

iSCASI: Internet Small Computer Systems Interface, an IP-based storage networking standard for linking data storage facilities. It provides block-level access to storage devices by carrying SCSI commands over a TCP/IP network.

Network Area/Attached Storage (NAS)

NAS is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS systems are networked appliances which contain one or more storage drives, often arranged into logical, redundant storage containers or RAID. They typically provide access to files using network file sharing protocols such as NFS, SMB/CIFS, or AFP over a optical fiber.

Basically the whole storage is exposed as a file system. When using a network file system protocol, you are using a NAS.

Storage Area Network (SAN)

A network of compute systems and storage systems is called a Storage Area Network (SAN). A SAN enables the compute systems to access and share storage systems. Sharing improves the utilization of the storage systems. Using a SAN facilitates centralizing storage management, which in turn simplifies and potentially standardizes the management effort.
SANs are classified based on protocols they support. Common SAN deployments types are Fibre Channel SAN (FC SAN), Internet Protocol SAN (IP SAN), and Fibre Channel over Ethernet SAN (FCoE SAN), ATA over Ethernet (AoE) adn HyperSCSI. It can be implemented as some controllers attached to some JBoDS (Just a Bunch of Disks).

While NAS provides both storage and a file system, SAN provides only block-based storage and leaves file system concerns on the "client" side. However, note that a NAS can be part of a SAN network.

The SAN can be divided in different Logical Unit Numbers (LUNs). The LUN abstracts the identity and internal functions of storage systems and appear as physical storage to the compute system.

If the drive is seen as physically attached to the machine, and a block transmission protocol is adopted that means that you are using a SAN. The optical fiber has become the bottleneck (just four drives to saturate a link).

With SAN the server has the impression that the LUN is attached directly to him, locally; with NAS there isn't this kind of abstraction.

HCI - Hyperconvergent Systems

This kind of software is expensive (Nutanix HCI is fully software defined so you do not depend on the vendors hardware).

The main idea is not to design three different systems (compute, networking, storage) and then connect them, but it's better to have a bit of them in each server I deploy. "Adding servers adds capacity".

The software works with the cooperations of different controller (VMs) in each node (server). The controller (VM) implements the storage abstraction through the node and it implements also the logical moving of data. Every write keeps a copy on the local server storage exploiting the PCI bus and avoiding the network cap; a copy of the data is given to the controller of another node. The read is performed locally gaining high performances. The VM is aware that there are two copies of the data so it can exploit this fact. Once a drive fails its copy is used to make another copy of the data. The write operation is a little bit slower since I need to wait for the 'ack' of the controller in order to keep replicas of the written data on other nodes (sync replica).

SDS - Software Defined Storage

Software-defined Storage is a term for computer data storage software for policy-based provisioning and management of data storage independent of the underlying hardware. This type of software includes a storage virtualization to separate storage hardware from the software that manages it.
It's used to build a distributed system that provides storage services. Uses object-based storage architecture (objectID, metadata, binary data).

Non-RAID drive architectures

Also other architectures exist and are used when RAID is too expensive or not required.

Some consideration about Flash Drives

The bottleneck in new drives is the connector. The SATA connector is too slow to use SSD at the maximum speed. Some results can be see here.

The solution? Delete the connector and attach it to PCIe. So new Specification is used, the NVMe, an open logical device interface specification for accessing non-volatile storage media attached via a PCI Express bus.

Storage in the future

Click to show or hide

As we can see in the image, it's been decades since the last mainstream memory update is done. In fact, the SSD became popular in the last years due the cost but they exists since 1989.

New technology was introduced in 2015, the 3D XPoint. This improvement takes ICT world in a new phase? If yesterday our problem was the disk latency, so we design all algorithm to reduce IOs operation, now the disk is almost fast as the DRAM, as shown the following image:

Servers

They are really different from desktops, the only common part is the CPU instruction set.
For instance, servers have an ECC memory with Error Correction Code built in.

Racks are divided in Units: 1 U is the minimal size you can allocate on a rack. Generally 2 meters rack has 42 Units.

Types of compute systems

Form-factors

In a standard 1U (aka Pizza Box), the bottom part is composed by

while the front (up) part contains:

Typically the max number of CPUs is four and they are close to the memory modules.

Differs from desktop systems.

Misc

Trade-off in CPU design: high frequency, low cores. All depends on the application running: it can benefit from high frequency or not (big data systems are more about capacity than latency).

Latency is slightly higher when I access a RAM bank of another socket because I have to ask for it via a bus that interconnects them (UPI in an Intel CPU).

Inside the core there are some funtional units like: branch missprediction unit, FMA (Floating point Multiply Add). Each core has a dedicated cache at L1 and a shared cache at L2.

SMART technology in drives: predictive system in the drive that gives the probability that the drive will fail in the next hours. Used by the driver provider for statistics, usage patterns.

Cloud

Is a business model. The cloud is someone else's computer that you can use (paying) to execute your application with more reliable feature than your laptop (i.e. paying for doing tests on your app using the cloud infrastructure because you need more resources). The interaction to obtain the cloud resources should be "self service" for as much as possible.
When you program for the cloud you dont know where your process will be executed or where you data will be stored.

Cloud is a collection of network-accessible IT resources:

One of the main concept of cloud computing is the one of pooling, which means that a set of heterogeneous resources can be viewed as a whole big resource in order to provide reassignment capability and location independence (which means that the client cannot control where his data are, except for maybe the geographical area). Another important concept is the one of resource measurement. The cloud computing business model revolves around pricing and resource consumption, so the system must be able to monitor it.

Cloud computing benefits are:

There is a trade off between centralization (the bottleneck is the storage) and distribution (the bottleneck is the network).

Rapid Elasticity: consumers can adapt to variation in workloads and maintain required performance levels. This permits also to reduce costs avoiding the overprovisining.

High Availability: the cloud provide high availability. This feature can be achieved with redundancy of resources to avoid system failure. Some Load Balancer is used to balance the request between all the resources to avoid failure due the resources saturation on some machine.

The cloud infrastructure can be public, if it is provisioned for open use by the general public; or private, if is provisioned for exclusive use by a single organization comprising multiple consumers.

Cloud computing Layers

The cloud infrastructure can be see as a layered infrastructure.

Cross functional layers

In the cloud computing reference model there are some sylos of cross layer functionalities, they mainly revolve around:

Physical Layer

Comprises compute, storage and network resources and execute both provider and consumer software. The compute system can be shared between consumers or dedicated, typically providers use virtualisation and offer compute systems in the form of virtual machines. There are several software components deployed on compute systems:

Compute system may be physically tower, rack or blade.

The storage system is the repository for saving and retrieving data. Cloud storage provides massive scalability and long-term data retention. Virtualisation is used by cloud providers to create storage pools shared among consumers.

Independently from storage devices there are several data access methods, as seen previously:

The network system must be reliable and secure, it enables data transfer and sharing of IT resources between nodes across geographic regions. The network system enables several kinds of communications:

Virtual Layer

Deployed on the physical layer. Abstract physical resources, including storage and network, and makes them appear as virtual resources. Executes the requests generated by control layer. It permits a better use of the hardware when you have services that underuse it. With VMs there is a 10% of performance loss but we gain in flexibility and security.

Benefits of virtualization:

This allows a multi tenant environment since I can run multiple organizations VMs on the same server.

Key concept is virtualisation: it enables a single hardware resource to support multiple concurrent systems or vice versa. It is composed by 3 entities: virtualisation software, resource pool and virtual resources. Virtualisation refers to all the resources provided by the physical layer hence we have compute virtualisation, storage virtualisation and network virtualisation. Hypervisors enable compute virtualisation: it is a software that enables a physical compute system to run multiple OSs concurrently. For every OSs there is a virtual machine manager on top of the hypervisor kernel.

The network virtualisation software enables the creation of virtual LANs, virtual SANs and virtual switches. The storage virtualisation software allows the creation of virtual volumes, virtual disk files and virtual arrays.

The resource pool is an abstraction of aggregated computing resources such as processing power, memory capacity and network bandwidth. Cloud services obtain computing resources from resource pools. LUN in SAN are an example of storage pooling. Aggregation of virtualisation software and resource pool make a virtual machine, that like a physical system runs OS and applications.

VM files are managed using hypervisor's native file system or a shared file system that enables a VM to use NAS devices, for example. The VM console is an interface used to manage and monitor a VM and can be local or remote. It can be used also for configuration, reboot etc. VM template is a "standard first version" of a VM that can be specialised: every VM can be converted into a template.

VM Network components

VM networks comprise virtual switches, virtual NICs, and uplink NICs that are created on a physical compute system running a hypervisor.

Click to expand

Virtual networks can be:

There exists a mapping between VSAN and VLAN to determine which VLAN carries a VSAN traffic.

As anticipated LUNs are logical components of a SAN (storage pool), their capacity can be dynamically changed. LUNs created from a storage pool can be of 2 categories:

VM components

The hypervisor is responsible for running multiple VMs. Since I want to execute x86 ISA over an x86 server I don't need to translate the code. An hypervisor permits to overbook physical resources to allocate more resources than exist and it also create also a virtual switch to distribute the networking over all VMs.

Hypervisors types:

Types of virtualization

Types of virtualization:

Virtual Machine (VM)

Each Virtual Machine is a set of discrete configuration files where there are the values aswering the questions: how much memory, how much disk, where is the disk file, how many CPU's cores. An example of those files are:

The disk is virtualized usign a file, while for the Network there is a VNIC (Network Interface Card) connected to a vSWITCH, comunicating with the physical NIC. The vNIC is used also by the real OS because it's physical NIC is busy doing the vSWITCH.

The Virtual Disk is a file of fixed size or dynamically expanding. The vOS can be shared among the VMs and stored elsewhere than in the vdisk file. Each write goes on the vdisk (can undo all the write ops), instead each read first look in the "file" where the vOS is, than in the vdisk file if the previous check wasn't successful. I can also freeze the virtual disk, and extend the file with the software I want to add, making also rollback possible. This file abstraction for the disk makes also possible the application of a copy on write mechanism. I can use the same portion of file to save an operating system, and then create only one virtual disk file containing the differences between the various virtual machine and the original disk (more or less like image layering in Docker*).

The Virtual CPU masks the feature of a CPU to a VM. The VCPU can be overbooked, up to twice the number of cores. The CPU has several rings of protection (user ... nested vos,vos,os).

vRAM ballooning

It's not allowed to use a virtual memory as vRAM because the sum of the vRAM should be less or equal to the actual RAM. Fragmentation could be a problem if there is lot of unused reserved memory. In order to achive this, a technique called ballooning has been introduced.
It is said to the VM: "Look, you have 1TB of RAM but most of it it's occupied". In this way we have dynamically expanding blocks of RAM: if the OS needs memory I can deflate the baloon by moving the occupancy threshold.

Docker

It exploits Linux's Resource Group. The processes in the container can see only a part of the OS. The containers have to share the networking. Docker separates different software stacks on a single node.

Control Layer

This layer includes software and tools useful for managing and controlling the underlying infrastructure. Control layer can be deployed on top of virtual or physical layer and it receives requests from service and orchestration layer: control layer provisions the required resources to fulfil these requests. Together with the virtualisation layer, the control layer provides a unified view of all the resources of the cloud infrastructure, enables resource pooling and dynamic allocation of resources.

There exist two types of control layer:

Another (new) approach to abstract the underlying infrastructure components is the Software Defined Approach. With this approach is possible to have an aggregated view of resources, it enables rapid provisioning and provides a mechanism to apply policies across the infrastructure. A software approach, like the unified manager, offers APIs to controllers enabling them to request and access resources like they were services. An approach like the one described improves business ability, brings to lower CAPEX (specialized hardware is not necessary) and provides a scale-out architecture.

The main duty of the Control Layer is the resource management and it can be relative or absolute. In the first case the resource allocation for a service is defined proportionally relative to the resource allocated to other service instances, in the second case the allocation is defined on the base of quantitative (lower and upper) bounds.

Every component has its resource management techniques.

Summarizing, control software:

Key phases for provisioning resources

Thin provisioning

This is a virtualization technology that gives the appearance of having more physical resources than are actually available. Thin provisioning allows space to be easily allocated to servers, on a just-enough and just-in-time basis. Thin provisioning is called "sparse volumes" in some contexts.

Open stack

A free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby virtual servers and other resources are made available to customers.

Good idea but bad implementation. Various open source softwares, difficult to deply, lots of dead code, bad security implementation. It has a small form of orchestration but it's not a service orchestrator (i.e. no distribution of the workload, scaling)

Service layer

What is a cloud service?
Cloud services are IT resources that are packaged by the service providers and are offered to the consumers. Once constituent IT resources are provisioned and configured, a service is instantiated.

Service layer

The service layer has three key functions which are as follows:

The service catalogue is no more than a menu of services, that lists the services available along with their attributes, costs, terms and conditions of use. For each service are specified the following: name, description, features and options (billing, subscription terms etc...), price, reference to SLA. Usually a client chooses a service from the catalogue, fill a form, agree the terms of use and gets the service. The consumer can interact with the rented services using cloud interfaces. With the management interface he can control his use of a rented service and with the functional interface he can use the services. Function provided by the two interfaces changes according to the rented service (IaaS, PaaS, SaaS).

Service catalogue and cloud interfaces are exposed to consumers through a cloud portal. Cloud portal is also used by cloud administrators to manage the whole infrastructure and the lifecycle of cloud services. The portal act also as an intermediary with the orchestration layer: it routes service request to the orchestration layer and shows response to the consumers.

THERE IS A MISSING PART ABOUT STANDARDS FOR THE COMMUNICATION BETWEEN SERVICES: TOSCA, REST, SOAP ETC... IT CAN BE FOUND INTO THE SLIDES. TODO: SUMMARIZE AND ADD THOSE PARTS TO THIS NOTES

Orchestration layer

Automated arrangement, coordination, and management of various system or component functions in a cloud infrastructure to provide and manage cloud services.

Cloud portal

A cloud portal is an access (usually web-based) point to a cloud, which provides access to the service catalog, and facilitates self-service provisioning and ongoing access to the cloud interfaces. A cloud portal is also accessed by the cloud administrators to manage cloud infrastructure and the lifecycle of cloud services.

Once a service provisioning or management request is placed in the cloud portal, the portal routes the request to the orchestration layer where appropriate workflows are triggered to fulfill the request. The orchestration layer is the automation engine of the cloud infrastructure, which defines standardized workflows for process automation. The workflows help orchestrating the execution of various system functions across the cloud infrastructure to fulfill the request.

Orchestration types

Tow different types of orchestration:

Orchestration APIs

APIs are used to perform activities such as:

Example of orchestration workflows

DB2 instance request CRM instance request

Service orchestration

Service orchestration provides several benefits:

Although some manual steps (performed by cloud administrators) may be required while processing the service provisioning and management functions, service providers are looking to automate these functions as much as possible.

Cloud service providers typically deploy a purpose-designed orchestration software or orchestrator that orchestrates the execution of various system functions. The orchestrator programmatically integrates and sequences various system functions into automated workflows for executing higher-level service provisioning and management functions provided by the cloud portal. The orchestration workflows are not only meant for fulfilling requests from consumers but also for administering cloud infrastructure, such as adding resources to a resource pool, handling service-related issues, scheduling a backup for a service, billing, and reporting.

Business Continuity layer

Business continuity is a set of processes that includes all activities that a business must perform to mitigate the impact of service outage. Business continuity entails preparing for, responding to, and recovering from a system outage that adversely affects business operations. It describes the processes and procedures a service provider establishes to ensure that essential functions can continue during and after a disaster. Business continuity prevents interruption of mission-critical services, and reestablishes the impacted services as swiftly and smoothly as possible by using an automated process. Business continuity involves proactive measures, such as business impact analysis, risk assessment, building resilient IT infrastructure, deploying data protection solutions (backup and replication). It also involves reactive countermeasures, such as disaster recovery, to be invoked in the event of a service failure. Disaster recovery (DR) is the coordinated process of restoring IT infrastructure, including data that is required to support ongoing cloud services, after a natural or human-induced disaster occurs.

Single point of failure

Single points of failure refers to any individual component or aspect of an infrastructure whose failure can make the entire system or service unavailable. Single points of failure may occur at infrastructure component level and site level (data center).

Methods to avoid Singole Points of Failure:

Redundancy

Redundancy is a technique used to avoid single point of failure. N+1 redundancy is a common form of fault tolerance mechanism that ensures service availability in the event of a component failure. A set of N components has at least one standby component. This is typically implemented as an active/passive arrangement, as the additional component does not actively participate in the service operations. The standby component is active only if any one of the active components fails. N+1 redundancy with active/active component configuration is also available. In such cases, the standby component remains active in the service operation even if all other components are fully functional. For example, if active/active configuration is implemented at the site level, then a cloud service is fully deployed in both the sites. The load for this cloud service is balanced between the sites. If one of the site is down, the available site would manage the service operations and manage the workload.

Be careful to active/passive failure, when a system fails but also the "passive" part fails immediatly because any checks have been executed.

Key techniques to protect compute:

Key techniques to protect network connectivity:

Key techniques to protect storage:

Service Availability Zones

A service availability zone is a location with its own set of resources and isolated from other zones to avoid that a failure in one zone will not impact other zones. A zone can be a part of a data center or may even be comprised of the whole data center. This provides redundant cloud computing facilities on which applications or services can be deployed.

Service providers typically deploy multiple zones within a data center (to run multiple instances of a service), so that if one of the zone incurs outage due to some reasons, then the service can be failed over to the other zone. They also deploy multiple zones across geographically dispersed data centers (to run multiple instances of a service), so that the service can survive even if the failure is at the data center level. It is also important that there should be a mechanism that allows seamless (automated) failover of services running in one zone to another.

Live Migration of a VM

Moving a VM from server A to B (from hypervisor A to hypervisor B) while it's running. The user could experience a degradation of the service but not a disruption.

In a VM live migration the entire active state of a VM is moved from one hypervisor to another. The state information includes memory contents and all other information that identifies the VM. This method involves copying the contents of VM memory from the source hypervisor to the target and then transferring the control of the VM’s disk files to the target hypervisor. Next, the VM is suspended on the source hypervisor, and the VM is resumed on the target hypervisor. Because the virtual disks of the VMs are not migrated, this technique requires that both source and target hypervisors have access to the same storage. Performing VM live migration requires a high speed network connection. It is important to ensure that even after the migration, the VM network identity and network connections are preserved.

Live migration summary:

The whole process is a little bit easier if both the VMs use a shared storage.

Server Setup Checklist

Share the identities of the users to not replicate them in each server:

Backups

A backup is an additional copy of data, created and retained for the sole purpose of recovering lost data. Backup and recovery operations need to be automatized due to their large workload. For Backing up data we need:

With replicas are data protection solutions. This task becomes more challenging with the growth of data, reduced IT budgets, and less time available for taking backups. Moreover, service providers need fast backup and recovery of data to meet their service level agreements. The amount of data loss and downtime that a business can endure in terms of RPO and RTO are the primary considerations in selecting and implementing a specific backup strategy.

RPO (Recovery Point Objective) and RTO (Recovery Time Objective) are key metrics when selecting and implementing specific backup strategies.

Network is the first problem when I want to make a backup, beacuse the size of the backup is bigger than the network bandwidth. Sometimes it's simply impossible to make a backup.

Backup types:

Backup as a Service: service providers offer backup as a service that enables an organization to reduce its backup management overhead. It also enables the individual consumer to perform backup and recovery anytime, from anywhere, using a network connection.

Backup window: the horizon effect: you decide a window but the stuff you need will be always in the deleted part.

Data Deduplication: the process of detecting and identifying the unique data segments (chunk) within a given set of data to eliminate redundancy. The use of deduplication techniques significantly reduces the amount of data to be backed up in a cloud environment, where typically a large number of VMs are deployed. Take the hash of two identical files, store only one of the two files and both the hashes. If the same file is required in two context, it is saved one time and is served to different context.

Replication : is the process of creating an exact copy of data with the aim of ensure availability of services. Replication can be local or remote.

There is also some advanced replication solution such as CDP (Continuous data protection): data changes are continuously captured and stored in a separate location; this allows to restore data to any previous point-in-time. CDP can be remote or local.

CDP is made of:

TODO: INSERT IMAGE OF CDP, CAN BE TAKEN FROM SLIDES

There exist providers that provides DRaaS (disaster recovery) services: in case of a disaster in the consumer production datacenter, its VMs are invoked into the DRaaS provider datacenter.

We have talked about infrastructure resiliency, but resiliency is a desirable property also at application level. For the application level there are the following techniques:

Security layer (TODO: complete)

The fundamental requirements of information security and compliance pertain to both non-cloud and cloud infrastructure management. In both the environments, there are some common security requirements. However, in a cloud environment there are important additional factors, which a service provider must consider, that arise from information ownership, responsibility and accountability for information security, and the cloud infrastructure’s multi-tenancy characteristic. Therefore, providing secure multi-tenancy is a key requirement for building a cloud infrastructure.

There are some well-known threats in the cloud environment. They are classified as follow:

Firewall, Antivirus, Standard procedures to direct safe execution of operations.

Levels of security

Above there is a list of common threats that consumer and providers must consider when handling cloud services. To be more specific, security mechanism must be deployed for each component of a cloud infrastructure:

Security as a service : delegate a third party for the security enhancement of your system. It allows consumers to reduce CAPEX and security management burden.

Another important concept in (cloud) security is GRC: governance, risk and compliance. Governance determine who/what has the authority to define policies and strategies. Risk management is the process of identify, assess, mitigate and monitor risks. Compliance is the act to adhere and demonstrate adherence to laws, regulations and internal policies.

Auditing is a monitoring mechanism used to evaluate the effectiveness of policies and enforcing mechanisms of policies.

Access Control Lists are difficult to manage with lots of users.
PAM (linux) Password Authentication Module: few systems use ACL via PAM.

auditing activity of checking that system security is properly working. Keep monitoring the interaction of the user on a resource; get an allert when something suspicious occurs.

MINIMUM PRIVILEGE PRINCIPLE : every user must be able to access only the information and resources that are necessary for its legitimate purpose.

right != privilege
The first is given to you by someone, the second it's posssesed by you just because who you are.
In Windows you (the admin) can take the ownership, but you can't give it. Noone logs as system (like linux root but in Windows). SID in Windows is unique for the entire system. (sysprep, sys internals, process explorer)

Firwall

Service Managment layer

Cloud service management has a service-based focus, meaning that the management functions are linked to the service requirements and service level agreement (SLA). Be aware of regulations and legal constraints that define how to run a system. Is this system behaving according to the regulations? Recap that information processors (cloud providers) are responsible of the data they process.

SLA Service Level Agreement: legal contract that you sign as a customer to the provider defining what the user is paying for.

Service availability = 1 - (downtime/ agreed service time)
The uptime is difficult to define and to test because the reachability of the cloud depend also from the service providers.

Service Operation management

Service Operation management is crucial, it keeps up the whole thing running.
Maintains cloud infrastructure and deployed services, ensuring that services and service levels are delivered as committed. Ideally, service operation management should be automated:

Service Level not only functional requirements.

Ensure charge-back (pay per use), show-back (I exhausted the resources so I need more): make a good use of the money spent on hardware, people. Measure how much are you efficient in spending money.

TCO (Toal Cost of Ownership): estimates the full lifecycle cost of owning service assets. The cost includes capital expenditure (CAPEX), such as procurement and deployment costs of hardware and on-going operational expenditure (OPEX), such as power, cooling, facility, and administration cost.

ROI (Return On Investment): reducing risk is a kind of ROI

CAPEX (CAPital EXpenses): buy something. It's a one time cost e.g procurement and deployment costs of hardware .

OPEX (OPerational EXpences): use something. It's a recurrent cost e.g power, cooling, facility, and administration costs.

Capacity Planning/ Management

Capacity Planning/ Management: make some forecast to find when we will exhaust the resources and how many resources we will really need. Ensure that a cloud infrastructure is able to meet the required capacity demands for cloud services in a cost effective and timely manner.

Common Methods to Maximize Capacity Utilization:

Monitoring: collecting data (in a respectfull way). Availability Monitoring, Capacity Monitoring, Performance Monitoring, Security Monitoring.

Monitoring benefits:

Examples of Performance-related Changes:

Keep track of things, processes, servers, configurations so that you can roll back.

Incident/Problem Management

Incident/Problem Management
Indentify the impact of a failure to all the other services.
Return cloud services to consumers as quickly as possible when unplanned events, called ‘incidents’, cause an interruption to services or degrade service quality.

Prevent incidents that share common symptoms or—more importantly—root causes from reoccurring, and to minimize the adverse impact of incidents that cannot be prevented.

Examples

Examples of Business Continuity Solutions

GDPR General Data Protection Regulation

About protection personal data. What's a personal data? i.e. matricola, email, phone number.. it's everything that uniquely identifies you.

GDPR applies both to digital and not digital information.

If you, as an individual, get damaged by a bad use of your personal data, you can complain to the data owner and get compensated.

Vendor Lock-in

The cloud introduces some problems, one of them is the vendor lock-in. It appers when I write a software that uses a vendor API that not respects any standard. If I would like to change cloud I use, I need to modify the code (good luck!).

Even in Open Source there is vendor lock-in due to the difficulty of mooving from the dependency of a software to another one. To avoid the vendor lock-in you should relay on different softwares and vendors.

Standardization-Portability

It' rare that a leading vendor define a common standard. Standardization it's important but it's not feasable. It partly avoids lock-in. ""The only thing that can be standardize it's the VM"". Every platform tends to have its own API. REST is the standard that is working today in the cloud.

Misc

Greenfield installation : format, configure everything from scratch, in opposition with brownfield installation: network is already existing, routers, hosts... I have to mantain support for legacy stuff and integrate the new technology. Greenfield installation is typically used when an infrastructure does not exist and an organization has to build the cloud infrastructure starting from the physical layer.

software licenses: boundary to the number of VM that I can have with that software running.

It's acceptable that some users experiments performances issues while upgrading.

Procedures are really important: knowing the procedure and applying it can avoid lost of data, users, money.

Erasure Coding like RAID 5 (xor), with n drives of data and k drives of informations

In class exercises

1) Spine and leaf VS traditional architecture

Question

Discuss the difference between spine and leaf fabric and the more traditional fabric architecture based on larger chassis. How bandwidth and latency are affected?

Solution

Spine and Leaf

Non modular, fixed switches are interconnected with some MLAG (Multi-chassis Link Aggregation). Loosely copuled form of aggregation: the two switches are independent and share some form of aggregation. Each leaf is connected to all the spines (if the leaf has 6 upwards ports, 2 are used to connect the two coupled switches in the leaf ,the others are used to the connection with the spine). At least 2 spines for redundancy. The spines are not connected each other.
LACP protocol allowing to bind multiple links to a single conceptual link (link aggregation, active-active).
over-subscription the links to the spine should be able to sustain the trafic coming from all the links below. This is not a problem for EW trafic between servers attached to the same switch (because the link to the spine is not affected).
Pros:

It became popular after 10 GBps; before it was difficult to use it with 4/8/16 ports per server. Different VLANs are used.

Traditional Chassis
Tipically two modular chassis connected by two links (STP) in atcive-passive (the second chassis goes up only when the first isn't working). The ration between the number of ports and the bandwidth is completely different from spine and leaf. Link aggregation is possible but it's not convenient.
Pros:

Today is not so much used because it's difficult to design a backplane offering terabits.

Latency
With spine and leaf we introduce more hops, so more latency, than the chassis approach. The solution for this problem is using as a base of the spine a huge switch (256 ports) which actually acts as a chassis, in order to reduce the number of hops and latency.

Bandwidth
To enlarge the bandwidth in a spine and leaf architecture we need only to add a new spine and to connect to all leaves. With the chassis approach we can add bandwidth adding new line cards (new switches) to the chassis, provided that there are free slots in the chassis.
In the spine and leaf arch we can upgrade a spine reducing the bandwidth, but still without disrupting the connectivity. In the traditional chassis an upgrade degrades the bandwidth => TODO: verify.

2) Orchestration layer

Question

What actions can take the orchestration layer of a cloud system, and based on what information, in order to decide how many web server istances should be used to serve a Web system?

Solution

Assuming the DB is distributed and has infinite capacity, because tipically the bottleneck is the Web Server

An orchestrator can:

Based on:

3) Datacenter architecture

Question

Discuss a datacenter architecture made of 10 racks. Assuming a power distribution of 15 KW/ rack.

Solution

Use an in row cooling approach trying to reduce the rows to be cooled. Do not forget to mention the PDU and the UPS. (2 plugs per rack 32A each).

Some claculations:

  1. Calculate the amount of current per rack:
    • 15000W/380V = ~40A per rack
  2. Each rack has 40A, so assuming that is contains 42 servers we have:
    • 380V*(40A/42) = ~360W per server (slightly less than 300 are required for the sole CPUs)
  3. Calculate the amount of current on the PDU:
    • 40A*10 = 400A for the racks
    • assuming a PUE of 1.2 and knowing that

  1. Dimension the UPS:
    • Assume that in case of PDU issues you want to keep alive ony half racks, you can buy a UPS capable of generating 240A

NB. We have not considered the power factor, which is a number equal to 1.0 or less. Reactance, obtainied by converting AC in DC, reduces the useful power (watts) available from the apparent power. The ratio of these two numbers is called the power factor (PF).

Solution 2

t1 and t2 are AC/DC transformers
ATS: Automatic Transfer Switch

15 KW/Rack x 10 Racks = 150 KW to deliver towards our DC

380 Volts x 150 KW = 400 Ampere (at least)

Assuming a PUE of 1.2, we need:

400 Ampere x 1.2 = 480 Ampere (including cooling and AC/DC power loss)

So, every couple UPS/PDU must manage 480 Ampere at full load.

According to Facebook, by eliminating centralized UPS/PDUs you can achieve a total loss of 7%.

4) SAN VS Hyperconvergent architecture

Question

A service requires a sustained throughput towards the storage of 15 GB/s. Would you recomment using a SAN architecture or an hyperconvergent one.

Solution

SAN area network (recap)
ISCSI internet protocol (SCSI on fiber) allows to mount blocks/disks.
Block-based access: you mount a chunk of bytes seen as a drive.
LUN (Logical UNits), can be replicated, compression can be used, it can be overbooked.
Servers and drives are separated, drives are pooled togheter.

SAN today is failing because the bandwidth of the drives saturates the links, making it impossible to pool a lot of drives.

NAS
It uses istead a network file system protocol to access the pooled resources (SIFS, NFS). We access files not blocks.
NAS gives the file system, with SAN I decide the FS.
Security in SAN is bounded to the compute OS, which decide the authentication domain.
Instead NAS has the responsibility of the security and the filesystem abstraction(Active Directory and NFS security).

Both SAN and NAS separate the storage from the compute. Configure one for all the storage (backup, compression...) and look at it as blocks or files.
This architecture is failing because of the throughput of the drive (very fast) that saturates the link.

HCI (hyperconvergent)
Before we talked about three independent units: compute, storage and network. With HCI istead we have boxes (servers) with a little bit of network, drive and compute.

It's not true that the compute and the drive are completely unrelated and can be completely separated: also the CPUs have their own limits in data processing even if large (risk to waist resources).

HCI by Nutanix allows to simply add a bit of storage, a bit of compute and a bit of network by buying a server. You pay as you grow and lower the risk, since you don't have to make capacity planning.

Discussion
The choice depends also on the kind of data I assume to process (assume at least one: sensors, bank financial data ...). For example HCI is not convenient if a want to do archiving because I pay for extra unused CPU.

It's not enough to say: I take 5 big drives, because their bandwidth can be a bottleneck.

SAN could be the good solution because it's cheaper. SAN can be used with tiering: in the first layer I keep SSD "buffers", in the second layer mechanical drives. If I keep a buffer of 1TB I'll have 6 minutes to copy down the buffered data to the mech drives.
Assuming 24 Gbps of incoming bandwidth and 1 TB of SSD buffer.
24 Gbps = 3 GBps --> 1000 Gb /3 = 330 s to saturate the buffer.
Netxt to the buffer there are mech drives (130/150 MBps)
I write to the SSD 3000 MBps but I copy to the drive (assuming just 1) 150 MBps. So the incoming bandwidth in the buffer is 3000 -150 = 2850 MB/s.
With one mech drive I will saturate the disk in 1000 GB / 2.8 GBps = 360 s = 6 min

If I consider the text of the exercise, in particular 'towards', as in the sense of "only writing", imagining to have to almost only archive data and read only from time to time, I can actually consider SAN, because if I go hyperconvergent I am paying also for the CPU which might be unused. If I instead have a balance between r/w and want a good throughput for both operations, or I have a peek and then a flatter period of time with few action, then I might choose better going hyperconvergent.

What should I look for..

5) Dimension a hyperconvergent system

Question

A service requires a sustained throughput towards the storage of 15 GB/s. How would you dimension an hyperconvergent system to ensure it works properly?

Solution

Look first at the network (fabric is the glue of the infrastructure). Can't have 100 GBps straight to the server because of spine and leaf, so I have to consider the idea of distribution (hyperconvergent).

Recap that:

First I have to choose the Ethernet bandwidth between (10-25-50-100-400), considering that 400 Gbps is achievable only on the spine, and not on the leaves.
Better 10 Gbps or 25Gbps depending on CAPEX.
With spine and leaf I have 50 Gbps cause I double (active-active).

Some calculations
We have 15 GB/s incoming bandwidth --> 15 * 8 = 120 Gbps
We first dimension a spine and leaf architecture to sustain this bandwidth value.
We have a couple of options:

To cover 120 Gbps we need at least 5 nodes --> 5*25 = 125 Gbps

We could also add more (up to 8-10) nodes to have redundancy and efficiency, but we will consider 5 in the calculations

Every HCI node will have some SSD (as buffer) and some mechanical drives.

Since we have 120 Gbps totally each node will receive 120 / 5 = 24 Gbps storage bandwidth
This is ok since the link to the node is 25Gbps (even if we have active-active configuration so the actual bandwidth is 50 Gbps)

Now we must consider the number of drives in each node. The drive throughput must sustain the incoming bandwidth of 24Gbps to avoid data loss. We know that SSD drives have a bandwidth of 500 MBps, so half a GB.
So 24 Gbps / 8 = 3 GBps

Remember that bandwidth are not fully used because of some overhead..(e.g. to connect two spine nodes together)

Other questions

@megantosh

My exam was a variation of the above questions:

Design a data center that should contain 40 Racks, consuming 15 kW each. Discuss all necessary considerations, e.g. Power Distribution, Cabling, Cooling.

drawing a scheme with the components like PDU etc. was appreciated. Discussing firefighting, cooling using natural resources (water from ocean etc.). Show that you can do the Math: 15000 Watt = 380 V * A * cos fi where cos fi is the heat dissemination happening from conversion of AC into DC current

1024 Servers need to be connected with any of the following switch options: 48x25Gbps East/West Traffic with 6 x 100 Gbps for North/South (and two other configs to choose from). Oversubcription level can be up to 1/6. Which network Topology would you choose? Discuss its pros and cons.

Solution 1

I gave a couple but he was more interested to see one discussed in thorough detail. So not to recite theory but to be able to apply the knowledge from the section above.
For 1024, 48x25, oversubscr 1/6. I went for spine/leaf model and he wanted to know how many switches would be required in that case (do not forget redudancy causes doubling + two links of the 6 will be gone for connecting the two spines together)

Solution 2

E/W : 48 x 25 = 1200 Gbps

N/S : 6 x 100 = 600 Gbps

1200 / 600 = 2:1 Oversubcription (Good)

How many switches should I buy?

1024/48 = 21,333 = 22 switches (leaves) at least

I don't have enough space to put every switch in a single rack (tor), can I re-organize the infrastructure in order to save space?

Yes, you can "merge" two switches phisically by switch aggregation and put them in a rack (tor). To aggregate two switches, we need to connect them with 2 cables (for resiliency). This means that we lost 2 ports on each side, so, a new oversubcription ratio should be computed.

If we imagine two of the given switches put together, we get:

E/W : (2 x 48 ports) x 25 Gbps = 96 ports x 25 Gbps = 2400 Gbps

N/S : [2 x (6 - 2) ports] x 100 Gbps = 8 ports x 100 Gbps = 800 Gbps

2400 / 680 = 3:1 Oversubcription (Still good)

How does live migration of a VM happen and would you prefer to do it over HCI or SAN?

as long as no detail is provided, you are allowed to make your own assumptions.
I said both should be effectively the same given that bandwidth would not be blocked and that we are using latest/fastest technology. Crucial part of the VM Migration was (apart from copying config files, moving virtual registers and halting the old machine for a millisecond) is that copying the files happens in the background. If a file is required to complete a process and it has still not been migrated, the new VM goes and fetches it first from the old VM before it continues with copying any other file*

Discuss the role (functions) of the orchestration layer. Give an example workflow. where does it lie in the cloud stack?

check the slides for sure, they are very helpful! A process I gave was provisioning an Alexa skill on AWS which requires building a Lambda function (an AWS service) and a Skill controller. I think he was happy to have a real-life example. Draw the workflow like a business process from the point of provisioning to the billing etc.

@giacomodeliberali

What is a virtual firewall? Where will you put it? And how many?

TODO: answer

About numbers

Current

Fabric

Disk and Storage

Real Use Cases

Open Source

In 2011 Facebook announced the Open Compute Project (OCP), an organization that shares designs of data center products among companies, including Facebook, IBM, Intel, Nokia, Google, Microsoft and many others.

Their mission is to design and enable the delivery of the most efficient server, storage and data center hardware designs for scalable computing.

Books & Guides

References

Contributors

Giacomo De Liberali
Giacomo De Liberali

📖
Frioli Leonardo
Frioli Leonardo

📖
Alessandro Pagiaro
Alessandro Pagiaro

📖
LorenzoBellomo
LorenzoBellomo

📖
Mohamed Megahed
Mohamed Megahed

📖
Aldo D'Aquino
Aldo D'Aquino

📖
Andrea Bruno
Andrea Bruno

📖
bongi23
bongi23

📖

This project follows the all-contributors specification. Contributions of any kind welcome!