Tahoe

This document describes some experiments with Tahoe, the interesting filesystem

Overview of how Tahoe works

Tahoe is known as the “Least Authority File System”, hey that sounds pretty good! What does that mean? Well, Tahoe is essentially a secure, decentralized, fault-tolerant filesystem. This filesystem is encrypted and spread over multiple peers in such a way that it remains available even when some of the peers are unavailable, malfunctioning, or malicious. The one-page summary explains the unique properties of this system. You may also be interested in a PDF which provides a more in-depth description of how the system works.

There is one ‘introducer’ and then there are nodes that contact the introducer to find out about other nodes. The nodes contact the introducer through a secret file URL called the introducer.furl, once your node has that information it will contact the introducer who will then make you part of our private tahoe grid. The introducer is not a storage node, unless you set it up separately to be one.

There are 10 shares generated per file that you upload. Ideally, eachstorage server should get an equal portion of those ten.

Known Limitations

There are currently some known limitations to Tahoe, but the biggest issues have been resolved in version 1.4.

Another interesting thing to note is that any file that was put onto the grid before your node was on the grid, will not be distributed to your node once it comes online. In otherwords, new nodes dont get stuff distributed out to them that already exists on the grid. So we have to make sure that the grid has enough fault-tolerant nodes online before putting files that we want to rely on existing. There should be a way to replicate the file to a node when it comes online. Such things have been discussed a lot, and previous projects that are ancestors of Tahoe have tried to implement it in various ways. Currently the upstream Tahoe developers feel that such things need to be taken care of by the user or by a higher layer of automation than Tahoe itself. There’s no way for Tahoe itself to really know how reliable which servers are, for example. So, assuming that some user or higher layer of automation has chosen that now is a good time to refresh the share distribution of a certain file, there’s an easy way for a user or automation to do that: download the file and reupload it. Also they are working on a more efficient way to accomplish the same thing, which is due to be released in Tahoe 1.3.0 ASAP.

It also is inadvisable to have the node and introducer operating out of the same directory, because the public/private keypair is stored in that directory for each of them, so it would conflict. This only matters if you are running an introducer.

To get started

There’s no need to edit your sources.list since tahoe-lafs is available at debian stable(wheezy). Just

apt-get install tahoe-lafs

Now create the directory that you want to provide storage space to the grid and create the Tahoe client:

dir=$directory
mkdir $dir &&
tahoe create-client $dir

Now edit the .tahoe/tahoe.cfg to put your nickname in the file, also set the advertised_ip_addresses to what you want to use.

Now download the attached introducer.furl and put it in that directory.

NOTE: This introducer.furl is just a test grid, and will likely get destroyed at some point as we figure this out more. There are a couple nodes on this test grid which are only accessible over a riseup VPN link, and at least one public node. It is anticipated that our grid will be completely transported over the intra-collective VPN.

Now start it up:

tahoe start .

If this tahoe node is on your local machine, you can point your webbrowser at the tahoe wui (web UI). If you are running a mozilla-based browser you will get an error:

Port Restricted for Security Reasons

You will need to get around that by editing something in
about:config

Now you should be able to see the current state of the grid, how many storage nodes exist, etc.

Usage

Need to write more, but here is a few things you can experiment with (try creating a file in the wui and in the cli for example)

allmydata.org/source/tahoe/trunk/docs/u...

Application to backups

There is an experimental plugin for duplicity which has been sent on the “tahoe-dev” mailling-list. With this and garbage collection implemented in version 1.4, Tahoe really looks like the proper tool for our backups.

 

i’d like to experiment this with a few other autonomous site/groups.
who?

 
 

tachanka.org? we should talk about this….

 
 

we’d very much interested in experimenting with this as well. sounds very interesting.

cybrigade.org

 
 

I’ve been working with upstream on getting these packages into debian/ubuntu, hopefully soon! I also hope to update this info with more up-to-date stuff as I go through things again.

 
 

While you’re at it, could you update the text about refreshing the share of a given file? You’re saying to be released in 1.3.0, now there is 1.4.0. I don’t know if it’s implemented or if there’s another way in the meantime?

And besides that, when using a tahoe-grid for backups, it seems to be relatively slow. But I wonder if someone could confirm or deny this.

 
 

I came across XTreemFS (http://www.xtreemfs.org) and that looks promising too. At first sigth one advantage is that the you have a fuse mountpoint and the handling looks easy.
A good cli and the possibility to have a filesystem mountpoint would be important for me.
Personally i find tahoe a bit complicated, the Web UI is not very intuitive.
Btw: How did we came to Tahoe ? Was there a decision between different solutions ? That would be good to know because i am searching for a similar solution and if someone sorted out other approaches, i could save some time :)

So, maybe we could check this too ? I will give it a try.

 
 

tostado: the people in #tahoe-lafs on irc.freenode.org are very helpful and constantly look how other similar solutions handle their problems etc. So if you’re giving it a try, don’t hesitate to ask politely there.

 
 

thanks, i will give this a try

 
 

tostado: I dont think we came to tahoe through any decision at all. we just were spending some time looking for backup solutions that we could share and this one came up so we started looking at it. i also find the web UI very complicated and not intuitive, however the rest seems very simple… just a directory that you mount and that is it.

i’m very interested in comparisons between different possibilities, for sure. i’m also interested in revisiting this, as I have put it on hold for some time.

 
 

Autistici.org has a test Tahoe cluster that we were thinking of using as a sort of distributed live filesystem (a few minor patches were enough to get their FUSE client up and running as well).

I just had a doubt about the cluster being ‘public’ anyway, i.e.: knowing the introducer url (easily findable from the web interface), even if I can’t access the existing contents, I can cause resource exhaustion by simply uploading new contents until storage is full. I hope I’m wrong on this…

 
 

Anyone has benchmarked this thing to see if it’s usable at all?

 
 

i’d like to get back into this… I think it could be an interesting way for all of us tech collectives to share backup space in a distributed/encrypted way. i started to work on the packaging for debian, but then I burnt out. i wonder if we can all work together on something?

 
 

Micah: the reason why I was looking at a slightly different use case (exploiting the rest api over http, as a storage service) instead of backups is the sheer amount of data required: a full set of our backups is around 100-200G, and with the default Tahoe redundancy values (n=10) that’s really a lot of data. It could still be useful to backup a selected sub-set of data though (configuration, other kinds of small-but-important data).

We were instead trying to use it as an internal, highly-available filesystem, meant to re-use the bunch of virtual machines we have lying around: the encryption scheme would allow us to store sensible data even in unsafe environments, and tahoe cluster management is trivial. The REST api can then be used to access the storage “service” from machines that do not participate directly in the Tahoe cluster (as I wasn’t comfortable with running the whole tahoe stack on, say, front-end servers)…

I have some silly code to share at code.autistici.org/svn/tahoe-client/tru... — it’s a simple class exposing a filesystem-like interface to the Tahoe REST api (note: it’s also dns-round-robin-aware)

By the way, the “blackmatch.py” file in that directory is the only FUSE client I was able to run successfully, it has a couple of tiny fixes with respect to the version in the tahoe tar.gz …

 
 

I read more documentation, and it is indeed pretty easy to tune the storage overhead factor for backups: (shares.total / shares.needed) == 2 would be roughly what we have now anyway.

 
 

an update… due to the work of bertagaz, I was able to sponsor upload tahoe packages and dependencies to debian. i’ve also got squeeze backported pieces available in the riseup repository if people are interested in trying them

 
 

micah, could you maybe do a squeeze package for tahoe ?

 
   

I have a squeeze backport of 1.8.2-1 available in the riseup debian apt archive:

deb http://deb.riseup.net/debian squeeze main

There is a newer 1.9 version that is in sid now, and I expect a newer version to come available really soon now. Once that is uploaded to debian, i’ll backport a newer version.