MultiPKFile(5)

Name

MultiPKFile - Vesta-2 cache server stable cache entry file format

Contents

Introduction

The Vesta-2 cache server's cache entries will be stored persistently in the repository. To make the lookup process faster, the cache entries with the same primary key (PK) are grouped together in a common file. This man page describes the layout of these files (the so-called "PKFiles") on disk.

Each PK is a 128-bit fingerprint. The Vesta-2 system is being designed with the goal of servicing up to 60 million cache entries or so. Hence, the PK space is extremely sparse. Even so, our first observation was that the number of expected cache entries is too large for us to store the cache entries for a single PK in a single PKFile.

Grouping PKFiles into MultiPKFiles

We plan to group the entries for several different PK's together in repository files (the so-called "MultiPKFiles"). Since the PK's are fingerprints, and hence (hopefully) random, there is no structure to the PK's. Therefore, it doesn't matter how we group them for purposes of achieving better locality. Our plan is to pick three (integer) values k, k', and m such that (m * k') + k = 128, and divide the bits of the PK like this:

    k'  k'          k'                 k
  +---+---+- ... -+---+---------------------------------+
  |   |   |       |   |                                 |
  +---+---+- ... -+---+---------------------------------+

  |<---------------------- 128 bits ------------------->|
The m groups of k' bits (the prefix bits) will be used to form the pathname of a file in the repository, with each group of k' bits corresponding to an arc of the pathname. Hence, all PK's whose prefix bits agree will be stored in the same MultiPKFile.

Thus, the important choice here is the value for k. How to divide up the remaining 128 - k bits (the prefix) into pathname arcs depends on how the repository implements its directory structures.

We don't have a good analytic way to choose a value for k, since we don't have enough (any?) data to indicate how PK's and cache entries will be distributed. Most PK's will have only a single cache entry, while some will have many. Performance will be best if the MultiPKFiles are all roughly the same size. Given our design goal of 60M entries and a worst-case estimate of only one entry per PK, using the values k = 104, k' = 8, and m = 3, we'd expect approximately 4 PK's per MultiPKFile. A larger value for k might be more appropriate.

Since our first choice for k may be sub-optimal, we need to think some more about how to make the design flexible enough that we can change the value for k in the future. One possibility would be to use part of the MultiPKFile namespace to encode the version of the k value.

Design Goals and Assumptions

The rest of this note sketches the layout of MultiPKFiles on disk. The goal of the design is to minimize the number of different disk read operations on a cache Lookup operation, since the disk operations are the largest source of latency.

We've made several assumptions in formulating our design. First, MultiPKFiles will be rewritten in their entirety by background cache server threads. Hence, we haven't focused on designing incremental structures. Second, since no time-critical path of the cache server is dependent on the writing of MultiPKFiles, we haven't ruled out designs that might require fairly large amounts of computation at MultiPKFile-write time.

The Structure of MultiPKFiles

The overall structure of a MultiPKFile is:

   <MultiPKFile>    ::= <Header> <PKFile>*
(We describe the structure of a MultiPKFile by a BNF grammar. In the grammar that follows, elements in angle brackets denote non-terminals, and * denotes Kleene closure.)

Abstractly, the MultiPKFile <Header> provides a map from PK's to the offsets of the subsequent <PKFile>'s in the file.

   <Header>         ::= version num totalLen type-code <TypedHeader>
   <TypedHeader>    ::= <SeqHeader> | <HashHeader>
The version is stored in the <MultiPKFile> for backward compatibility, in case we decide to change some aspect of the representation in the future. num is the number of entries in the <Header>, as well as the number of <PKFile> records following the <Header> in the overall <MultiPKFile>. totalLen is the total length of the <MultiPKFile>; it can be used to compute the length of the last <PKFile> in the <MultiPKFile>.

We discussed three ways to represent the <Header>. Since we may want to mix these three schemes, the <Header> has an integer type-code to indicate the format of each header. Here are the three schemes for representing the <TypedHeader>:

  1. If the number of <PKFile>'s in a <MultiPKFile> is small, we can simply list the entries and search them sequentially:
       <SeqHeader>      ::= <HeaderEntry>*
       <HeaderEntry>    ::= <PK> offset
       <PK>             ::= fingerprint
    
    For the <PK>, we only have to store the last k bits, since all PK's in the MultiPKFile will have their first 128 - k bits in common. The offset value is an integer representing the location of the corresponding <PKFile> relative to the start of the <MultiPKFile>.

    The PKFile's corresponding to the <HeaderEntry> records in a <SeqHeader> are written out in the order in which they appear in the <SeqHeader>. As a result, the offset values are monotonically increasing, and the length of a PKFile can be computed as the difference of consecutive offset values. The length of the last PKFile can be computed using the totalLen value of the <Header>.

  2. If there are a large number of <PKFile>'s within a <MultiPKFile>, we could use the same <SeqHeader> representation as in scheme 1, but store the <HeaderEntry>'s in sorted order by pk value. Then we could do binary search on the pk value to find the offset for its corresponding <PKFile> more quickly.

  3. We could use a simple hashing scheme. The hashing scheme could be parameterized by the table size. If we wanted to get really fancy, the hash function itself could be parameterized relative to some family of hash functions and chosen so that there were no collisions for that table. By our second assumption above, it would be okay to spend a fair amount of time searching for a perfect hash function when the MultiPKFile has to be rewritten. Here is one way to represent the hash table:
       <HashHeader>     ::= tblSize hashKey <Bucket>* <BucketEntry>*
       <Bucket>         ::= 0 | offset
       <BucketEntry>    ::= bucketSize <HeaderEntry>*
    
    The tblSize is the size of the hash table (i.e., the number of following <Bucket> records), and hashKey is a value that parameterizes the hash function used for this table. The <Bucket>'s are offsets (relative to the start of the <HashHeader>) where the <BucketEntry>'s for that bucket begin, or 0 if the bucket is empty. Each <BucketEntry> is some number of <HeaderEntry>'s, prefixed by the number of such entries (if we use perfect hash functions, the the bucketSize would always be 1, and could be omitted).

We expect that we will probably use scheme 1 or 2 for the MultiPKFile <Header>.

The Structure of PKFiles

The overall structure of a <PKFile> is:

   <PKFile>         ::= <PKFHeader> <PKEntries>*
   <PKEntries>      ::= numEntries <PKEntry>* <PKEntryExtra>*
The entries in a <PKFile> all have the same primary key. They are also grouped on the disk by a secondary key. There is one <PKEntries> record for each secondary key, and the <PKEntry>'s listed together under a single <PKEntries> record all have the same secondary key. There is one <PKEntryExtra> record for each <PKEntry>; they contain extra values that are not required for the Lookup operation.

What goes into a <PKEntries> record? An (abstract) cache entry has the following fields:

The fps values are not needed to perform the Lookup cache operation, but they are required to recompute the cfp and ucfp values when the set of common FV's for the <PKFile> changes (see below).

According to the cache server specification, the cache also stores certain information in the <PKFile> that applies to all its entries. This information is stored in the <PKFHeader>:

The freeVars field within an entry is actually not stored explicitly. Instead, the names of each entry's free variables are stored implicitly by referencing the allNames field of the <PKFile>. The common names need not be stored, since they are represented by the commonNames bit vector in the <PKFHeader>. The remaining uncommon names are stored as a bit vector in the <PKEntry>; this bit vector is also interpreted relative to the allNames field, and differs from one entry to the next.

To check for a cache hit, the cache server computes the combined fingerprint of the fingerprints of the common FV's in the current evaluation environment. It compares this computed common fingerprint to the cfp fields of the entries. For each entry that matches, it computes a combined uncommon fingerprint for each of the free variables in the current evaluation environment matching the entry's uncommon names. If both the cfp and ucfp tests succeed, the corresponding value is returned.

What's important to notice is that the cfp test can be made without any per-entry computation. Also, the only per-entry data required for this test is the value of the entry's cfp field. By contrast, the ucfp test requires the server to read both the uncommonNames bit vector and the ucfp values from the entry, and to compute a combined uncommon fingerprint. So the ucfp test is more expensive.

We decided the the entry's cfp fields should also be stored in the <PKFHeader>. The cfp computed for the common free variables in the current evaluation environment thus serves as the secondary search key. Each of these fingerprints is 16 bytes, so the hope is that the <PKFHeader> will still be small enough that it can be read by a single disk operation.

Here is the layout of the <PKFHeader>:

   <PKFHeader>      ::= <CFPHeader> <PKFHeaderTail>
   <CFPHeader>      ::= num type-code <CFPTypedHeader>
   <PKFHeaderTail>  ::= pkEpoch namesEpoch <AllNames> <CommmonNames>
   <AllNames>       ::= numNames <Name>*
   <Name>           ::= nameLen byte*
   <CommonNames>    ::= <BitVector>
   <BitVector>      ::= numWords word*
As before, there are several possibilities for the design of the <TypedCFPHeader>. Abstractly, the <CFPHeader> is a table that maps the common fingerprint value (16 bytes) to a list of cache entries with the same PK and cfp values. These designs are similar to the 3 <TypedHeader> designs above, except that the schemes are generalized to accomodate lists of entries.
   <CFPTypedHeader> ::= <CFPSeqHeader> | <CFPHashHeader>
  1. If the number of entries in a PKFile is small, we can simply search them sequentially, using the <SeqHeader> structure described above.
       <CFPSeqHeader>   ::= <CFPHeaderEntry>*
       <CFPHeaderEntry> ::= <CFP> offset
       <CFP>            ::= fingerprint
    
    A <CFPSeqHeader> consists of num <CFPHeaderEntry> records. The offset of a <CFPHeaderEntry> is the offset where its corresponding <PKEntries> record begins. The offset is taken relative to the start of either the <PKFHeader> or the <CFPHeader> (whichever we think is easier to compute against).

  2. If a PKFile has several entries, we can use the <CFPSeqHeader> representation, but store the <CFPEntry>'s in sorted order by cfp. This would allow us to use binary search on the table.

  3. Another alternative when there are lots of entries with the same PK is to use a hash table, using the <HashHeader> structure described above:
       <CFPHashHeader>  ::= len tblSize hashKey <Bucket>* <CFPBucketEntry>*
       <Bucket>         ::= 0 | offset
       <CFPBucketEntry> ::= bucketSize <CFPHeaderEntry>*
    
    The len field in the <CFPHashHeader> is the total size of the <CFPHashHeader>, so we can seek past the <CFPHeader> to the <AllNames> record without having to inspect the bucketSize values of the individual <CFPBucketEntry>'s.

We expect that we will use scheme 2 or 3 for the <CFPHeader>.

The Structure of Cache Entries

All that's left is to describe the format of the cache entries themselves. The data in a cache entry is split between the <PKEntry> data required to perform the Lookup operation, and the <PKEntryExtra> information required when MultiPKFiles are rewritten.

Here is the structure of a <PKEntry> record:

   <PKEntry>        ::= ci <UncommonNames> <UCFP> <Value>
   <UncommonNames>  ::= <BitVector>
   <UCFP>           ::= xortag fingerprint
   <Value>          ::= numBytes byte*
The ci and numBytes fields are integers, and the xortag is a one-word tag formed by XOR'ing together the uncommon fingerprints. The <UncommonNames> bit vector is interpreted relative to the <AllNames> record of the corresponding <PKFHeader>. The <Value> is simply a sequence of numBytes bytes.

Here is the structure of a <PKEntryExtra> record:

   <PKEntryExtra>   ::= numNames index* fingerprint*
The <PKEntryExtra> encodes the fps field of an entry (the fingerprints of the entry's free variables). The numNames value is an integer, followed by three lists each of numNames elements. The index values are integers representing the inverse of the mapping from the entry's PKFile header <AllNames> indices to indices in the fingerprint list. If we denote the jth element of a list list by list[j], then for all values i in the half-open interval [0, numNames), the fingerprint of the name AllNames[index[i]] is fingerprint[i]

Syntax Summary

For reference, here is the complete MultiPKFile grammar:

   <MultiPKFile>    ::= <Header> <PKFile>*

   <Header>         ::= version num totalLen type-code <TypedHeader>
   <TypedHeader>    ::= <SeqHeader> | <HashHeader>
   <SeqHeader>      ::= <HeaderEntry>*
   <HashHeader>     ::= tblSize hashKey <Bucket>* <BucketEntry>*
   <Bucket>         ::= 0 | offset
   <BucketEntry>    ::= bucketSize <HeaderEntry>*
   <HeaderEntry>    ::= <PK> offset
   <PK>             ::= fingerprint

   <PKFile>         ::= <PKFHeader> <PKEntries>*

   <PKFHeader>      ::= <CFPHeader> <PKFHeaderTail>
   <CFPHeader>      ::= num type-code <CFPTypedHeader>
   <PKFHeaderTail>  ::= pkEpoch namesEpoch <AllNames> <CommmonNames>
   <CFPTypedHeader> ::= <CFPSeqHeader> | <CFPHashHeader>
   <CFPSeqHeader>   ::= <CFPHeaderEntry>*
   <CFPHashHeader>  ::= len tblSize hashKey <Bucket>* <CFPBucketEntry>*
   <Bucket>         ::= 0 | offset
   <CFPBucketEntry> ::= bucketSize <CFPHeaderEntry>*
   <CFPHeaderEntry> ::= <CFP> offset
   <CFP>            ::= fingerprint
   <AllNames>       ::= numNames <Name>*
   <Name>           ::= nameLen byte*
   <CommonNames>    ::= <BitVector>
   <BitVector>      ::= numWords word*

   <PKEntries>      ::= numEntries <PKEntry>* <PKEntryExtra>*
   <PKEntry>        ::= ci <UncommonNames> <UCFP> <Value>
   <UncommonNames>  ::= <BitVector>
   <UCFP>           ::= xortag fingerprint
   <Value>          ::= numBytes byte*
   <PKEntryExtra>   ::= numNames index* fingerprint*

Analysis

To determine how well this design meets our goal of minimizing the number of different disk reads per Lookup, we must first estimate the sizes of the various data records.

First, consider the <Header> record. Most of its space is devoted to the <HeaderEntry>'s, which are at most 20 bytes (16 bytes for the pk and 4 bytes for the offset). So, the size of the <Header> is roughly "20 * n", where n is the number of <PKFile>'s in the <MultiPKFile>. If the <HashHeader> representation is used, then some more bytes are required by the <Bucket>'s. If each offset is 2 bytes and the table is only half loaded, then "4 * n" bytes are required for the <Bucket>'s.

The operating system buffers disk operations, transferring data in blocks on the order of 8K bytes in size. So, with a single disk operation, we can do a search on the <Header> using one disk read if the MultiPKFile contains at most on the order of 340 PKFiles. We don't expect to have nearly that many PKFile's per MultiPKFile.

On to the <PKFHeader>. As with the <Header>, most of the space is devoted to the <CFPHeader> table mapping common fingerprints to the offsets of the corresponding <PKEntries> records. These tables have the same size as the <Header> records, about "24 * n", where n is the number of entries in the table. In this case, however, we expect n to be quite a bit larger, so multiple disk reads may be required to complete this part of the Lookup operation. By storing the epoch value first, the cache can report an FVMismatch result without doing further disk reads.

The <AllNames> record may be quite long -- perhaps on the order of 600 bytes or more, assuming 30 20-character names. We might be able to do some simple compression on the names in an <AllNames> record, since we expect that many of them will share common prefixes.

Since the <AllNames> and <CommonNames> records only have to be read when those values have not been stored in the cache's in-memory data structures, it seems a shame to interpose them between the <CFPHeader> and the <PKEntries>. We might be able to reduce the number of disk reads in some cases by storing the <AllNames> and <CommonNames> records after the <PKEntries>, at the expense of an extra field in the <PKFHeader> indicating where the <AllNames> record begins.

The <PKEntry>'s themselves are relatively small. The ci, <UncommonNames>, and <UCFP> records require about 28 bytes. The <Value> record varies from entry to entry. Most <Value> records will be pointers to derived files in the repository, requiring only 20 bytes. But a small fraction of them will be pickled Vesta values, which can be quite large. Since the <Value> records are only required in the event of a cache hit, it might be better to store all of the <Value> records together after the <PKEntry>'s at the cost of an extra integer offset in the <PKEntries> record indicating where the <Value> records begin.

In a similar vein, notice how the <PKEntryExtra>'s have been stored after the <PKEntry>'s. This is because the <PKEntryExtra>'s are not required on the Lookup path; storing their values within the <PKEntry>'s would hurt data locality and increase the number of disk reads required per Lookup.

The net result of this design is that we expect that misses in the cache can be determined in as few as 2 disk reads, but usually 3. We expect hits in the cache to require 4 or more disk reads.

See Also

PrintMPKFile(1), VCache(1)

Author

Allan Heydon (caheydon@yahoo.com)

Last modified on Thu Nov  8 12:48:54 EST 2001 by ken@xorian.net
     modified on Sun Apr  6 18:10:03 PDT 1997 by heydon
This page was generated automatically by mtex software.