Webdav storage

From wikidev.net

boccnare Random thoughts about http put/webdav/deltav-based storage with the following aims:

  • symmetrical distributed storage. uses many simple machines and their disks, not even RAID necessary
  • client-side caching, RAM/disk-based
  • transparent read and write failover, with broken machines being the norm rather than the rare case
  • robust protocol with potential to use it over WANs

The idea is to move the bulk of information (current/old revisions and images) to distributed storage without sacrificing the flexibility and performance of metadata in the DB.

Table of contents

Fast reading, transparent failover

Squid on each Apache node for read caching, load balancing and transparent read failover- accesses three or more nodes per resource group (by title hash) that form kind of a RAID round-robin. Caches data (including old versions of pages to avoid re-generating them from diff) mostly in RAM, large disk caches might not be useful. Resource groups split by hash of title or similar. List is available on all servers, each server knows which group he's in. All Squids configured as siblings to get one big cache without requiring much disk space per machine.

Storage side

Put/Webdav enabled Apache (Apache 2 or mod_dav) with versioning support running on each of the Apache nodes. Could even be the same Apache that's also rendering the web pages. Newer Apaches support DeltaV natively with the possibility to add custom storage backends and (HTTP GET) url accessors

Unique version id's, one file per revision that never changes

ID created before saving on the client side from time (high resolution) and a hash of the content. Url like urlencoded_title/20050102123456-mycontenthash, with urlencoded_title being a directory on the storage server and 20050102123456-mycontenthash a file in that dir representing that version.;

Write failover managed by client, sync done automatically by storage servers

Wiki software writes to all members of the resource group at the same time, using the same time+hash-based version id. If write fails, immediately send 'you need to sync file x to failed server y' to all servers where the write worked out which log this persistently and try to sync the secondary server afterwards. Possibly lazy diffing at night- just store the full versions during the day using one file per version their time/hash-based version name and convert most of them (keep the first ten or so in full format) to diffs later. Store every nth version in full to avoid slow old version retrieval, delete full version for others and replace with 20050102123456-mycontenthash.diff.

Old version retrieval

Hook up the 404handler or something to call the 'create that version' function that regenerates it from the diff starting from the last full version. If this fails, try to get the version from the other servers in the resource group and return that result.

Current version pointer

Could be just an index.html symlink to the last version, automatically updated while storing the last version. Symlinking code needs to check if a version with a higher timestamp is available in the repository to avoid problems with concurrency. Also purge squids on clients for that version.

Links

  • MogileFS (http://www.danga.com/mogilefs/) -- similar approach also using HTTP PUT, stand-alone in perl. No write-many, instead replication. Uses a db to track location of files. Used for images at LJ.
  • XDelta (http://www.xcf.berkeley.edu/~jmacd/xdelta.html) -- better than diff, also supports binary files, used in rsync (afaik)
  • YAM (http://zerowing.idsoftware.com/yam/FrontPage) -- some related ideas and benchmarks, project appears dormant
  • Python DavLib (http://www.lyra.org/greg/python/davlib.py) and some sample code (http://www.lyra.org/greg/python/davtest.py)

Other thoughts

  • Would be good not to violate http put/get/webdav.
  • Failover setup needs proxy script that uses the same storage failover/resource group algorithms to enable external WebDav access.
  • WebDav has rich support for versioned metadata, requests, moving files etc.
  • Need for a tool to repartition resource groups (move to different servers), should be possible online
Navigation