The goal of resource virtualization is to create a layer of abstraction between actual physical hardware providing resources (power, disk space, CPU cycles, RAM, connectivity, etc) and the logical or semantic activities which consume those resources (web service, e-mail, ssh, etc). This abstraction could make it easier to offer more reliable and powerful service because of advantages in redundancy, flexibility, and service isolation.
There are a handful of different technologies available to us to attempt to put together something that makes sense. The only problem is that this handful of technologies are all over the place, occuring on dizzing layers of the network and application stack. What are these technologies, what do they do and how do they compare with each other? Which ones are mature enough to use in a production environment?
Here we will try to lay out the different technologies, explain what they provide, explore their viability, and compare their relative strengths and weaknesses to hopefully settle on a set of implementations that make our lives better through resource virtualization.
Some of us got together in person to setup a lab to attempt to implement and explore these using a handful of networked PCs, with a minimal stock install of debian lenny (testing), we experimented with configuring these machines as servers and clients to test various resource virtualization strategies. Here you can find detailed reports of these findings.
Networking Disk Devices¶
AOE stands for ATA over Ethernet, its a way to replace your IDE cable with an Ethernet network, effectively decoupling the storage from the computer, and replacing it with the flexibility of Ethernet. AOE is a low-level network protocol (IEEE 0×88a2), lower-level than TCP/IP. In fact this is a distinguishing feature compared with iSCSI, which transports its I/O over the higher-level TCP/IP and as a result gains the overhead of that protocol stack (due to the computer having to work harder to ensure the complexity involved in ensuring reliable transmission of the data over the net). Although iSCSI allows you to avoid the expensive Fiber Channel equipment, people often are finding themselves needing to purchase TCP offload engine cards to get rid of the burden of doing the TCP/IP on the machines themselves when doing iSCSI. This is a waste, because iSCSI isn’t typically done over the Internet, so what happens if we get rid of the whole TCP/IP overhead? (Also worth nothing that NBD is done over TCP/IP) We get AOE!
So a benefit of AOE is that it has very minimal overhead, especially when coupled with a quality Ethernet switch that can maximize throughput and minimize collisions through integrity checking and packet order. If you are wanting to get your block devices available over the Internet, or you need to route your devices, you will likely not want to be using AOE, but instead look into iSCSI (or NBD) instead, although it is possible to encapsulate AOE in various ways to transmit and route its packets, its probably better to use something designed for that purpose instead of adding a layer of complexity on AOE to achieve what something else does natively. AOE packets are not routable, the devices themselves don’t have IP addresses, and there is no way to “firewall” off devices, instead with AOE you need to put the devices on isolated networks, using port-based VLANs is a good way to partition-off the broadcast domain.
AOE is also relatively inexpensive, all you typically need are some dual-port Gig-E cards, a GiG-Ethernet switch and your disks.
AOE basically just embeds in the packets the ATA commands for a particular drive, or the response from the drive itself. In Linux AOE is available as a module which does the translation to make the remote disks available as actual local block devices (e.g. /dev/etherd/e0.0) and handles any retransmissions that are necessary. AOE also has a device discovery mechanism to query the network for available exported devices.
AOE is a remarkably simple protocol, its specification is only 8 pages long!
Terminology: AOE has two terms, shelf and blades. A “shelf” could be considered a piece of hardware that contains blades. For example a 3u shelf might contain a set of 8 blades. The blades themselves are small computers that do the AOE to get the ATA disk on the ethernet. Typically you would stripe data over the blades in the shelf to result in a single ATA with the same throughput as a local device.
NBD stands for Network Block Device, its a kernel module that can use a remote server as one of its block devices. On the local system there is something like a /dev/nd0 which when accessed sends a request to the remote server over TCP, which replies with the data.
How is this different from NFS? The primary difference is that NBD operates on the block device level (it exports block device semantics over the network), while NFS operates at the filesystem level (it exports filesystem semantics over the network). Unlike NFS, it is possible to put any file system on an NBD device, also if someone has mounted NBD read/write, you must make sure that nobody else has it mounted, NBD will not do this for you.
What are the limitations of NBD? It requires user-land programs to access it, although it has been put into an initrd so it can be used as a root filesystem.
The difference with NBD and AOE is that AOE is a network protocol for ethernet and NBD is not a network protocol but instead it is analogous to the aoe driver (although not the AOE protocol), it uses TCP/IP as the network protocol for transmitting information and data.
Status: NBD was included in the Linux kernel in 2.1.101.
ENDB is the Enhanced Network Block Device. It was an industry-funded academic project to make NBD better. ENDB has a few features over NDB. According to the website, the chief differences are that ENBD uses block-journaled multichannel communications; there is internal failover and automatic balancing between the channels; the client and server daemons restart, authenticate and reconnect after dying or loss of contact; the code can be compiled to run over SSL channels:
- user-space networking, combined with a new multichannel self-balancing asynchronous architecture in the kernel driver, and
- automatic restart, authentication, reconnect and recovery by the user space daemons, and now (in enbd post-2.4.27) …
- support for remote ioctls, and …
- support for removeable-media such as cdroms and floppies as the remote resource, and …
- support for partitioning on NBD devices, and …
- support for intelligent embedded RAID1 mirroring
The Debian package indicates that ENDB is an alternative for kernel NBD, designed for RAID over NBD, “it is especially designed to provide an efficient and robust device, which enables you to easily run RAID over NBD”
Need details on this.
DRBD is a network block device designed to be mirrored and redundant.
Need details on this.
Networking Disk Hardware¶
- Fibre Channel
iSCSI runs its I/O over TCP/IP, this saves money because you can avoid purchasing Fiber Channel devices and interfaces, and instead use the relatively inexpensive and flexible Ethernet to transport the data. Unfortunately, because iSCSI uses TCP/IP it requires a lot of processing by the host machine’s CPUs, so many people end up purchasing expensive TCP offload engines to do the TCP/IP work on a dedicated hardware card. The savings gained from not having to buy Fibre Channel are effectively lost in the TOE engines. The benefit of iSCSI is that you can do it over the Internet, but this is also a negative effect because if you just need the packets to go to a machine connected via a switch, its total overkill and a waste of resources.
iSCSI is a complicated protocol, its specification is 257 pages long! It has a lot of extra features, such as encryption, routing, ACLs, etc.
AFS, OpenAFS, Coda, Intermezzo, Lustre¶
It seems as if the lineage of these filesystems followed the following path: afs→coda→intermezzo→lustre→glusterFS, need details on these
OpenAFS appears to still be under active development as new versions make it into debian regularly. OpenAFS has several interesting properties:
- global namespace (/afs/cellname/ is accessible from any AFS-capable client)
- krb5-based authentication and privacy
- possible to have redundant/distributed read-only volumes
- string-based user identities (as opposed to numeric UIDs)
the codebase seems to be huge, with a lot of legacy parts in it. it’s not clear how well audited it is.
Though coda and intermezzo both ostensibly derive from OpenAFS, neither of them seem to have attracted a large following proportional to afs.
dkg has set up demonstration OpenAFS cells, but doesn’t use them regularly.
Glusterfs is a giant, growable, redundant filesystem.
Not really sure what this is, but probably only a locking / sync mechanism, the webste is jargony: “most notably new write-back, read-readahead, and stat-prefetch translators taking advantage of a new asynchronous communication framework” huh? Maybe depends on a nfs’d drive?
pvFS is sponsored by NASA.
Cluster aware FS, need info
Sun’s filesystem, need info
NFSv3 — tried and true, but with an atrocious security model. The server basically trusts the client machine to correctly report the identity of the user.
NFSv4 — very new, claims to be able to use krb5-based authenticatin, privacy, and integrity checks. Like OpenAFS, also offers string-based user identities, though they’re mapped locally to numeric uid’s and gid’s. Has anyone set up an NFSv4 domain?
S3 is Amazon’s suposedly unlimited filesystem that is typically paired with their “Electric Cloud” (ec2) – If you read thefine print they do not have unlimited machines, and request that you get a hold of them if you need more than 10 machines. Maybe when you get this big amazon is no longer cost effective and you should be getting a rack. The connection between ec2<→s3 apparantly is only 10Mbit/sec! On a EC2 machine, you get 160GB of local disk, and pay .10 per hour of uptime per host, which is about $72/month…not that cheap for a virtual server (doesn’t include bandwidth which is $0.20 / GB which is … $35/mbit/sec at 95%… I think we get $20/mbit/sec). The machines are basically single 1.7ghz x86, 1.75gb of RAM, 160gb of disk, 250mb/sec of network bandwidth.
Google Bigtable FS (also known as GFS)¶
Google’s filesystem is what they use internally for all of their data services, its not available publically, and likely wont ever be as it has been designed for their specific environment. Its interesting, especially the immutable on-disk table structures, multi-dimensional time slices.
Networking other resources¶
RDMA is Remote Direct Memory Access, it allows data to move directly from the memory of one computer into that of another without involving either one’s operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. RDMA relies on a special philosophy in using DMA. I think you need special RDMA interconnects (ie. infiniband and other hardware, its a NIC feature, unbuffered, no CPU, cache or context switches involved). Multicasting is somehow involved.
- Mad scientist lab experiment number 1: ATA-over-Ethernet (AOE)