Data Requirements

Currently Sequins supports two input file formats:

SequenceFiles
Sparkey files

There are also a few guidelines for producing data for sequins.

SequenceFiles

The default input format for Sequins is SequenceFiles. These are typically generated by Hadoop Map/Reduce.

Naming

Sequins doesn't care at all how your SequenceFiles are named, but will ingest any files in a version directory.

Compression

You can use either BLOCK or RECORD compression for SequenceFiles, or leave them uncompressed. Both snappy and gzip are supported.

Sharding

Sequins uses a sharding algorithm in which keys are bucketed into N partitions, where the number of partitions is the number of files in the dataset. To do this, it uses a hash function based on Java's String hashCode function, chosen because it aligns coincidentally with the way Hadoop often buckets keys into reducers.

In practice, this means that if your Hadoop job outputting the data has a reduce step, and the key is a String, and has the same value as the output key, then your data is effectively pre-sharded when it's written out, and each individual sequins instance can just download the data it needs, than rather scanning everything and downloading just the keys it wants.

While this obviously isn't a hard requirement, it can make loading new data much faster.

Key and value serialization

Hadoop often writes SequenceFiles with individual key and value serializations; in this way, SequenceFile is generic.

Sequins supports any key and value serialization, but has some optimizations for the commons ones. In particular, org.apache.hadoop.io.BytesWritable and org.apache.hadoop.io.Text are unwrapped, and the actual underlying bytes are used. For example, if you have a SequenceFile[BytesWritable, Text] and a record is saved as:

context.write(new BytesWritable("foo".getBytes("ASCII")), new Text("bar"))

Then you can query the value at /mydata/foo, as you'd expect.

However, with other serializations, sequins doesn't do any converting for you, and you'll need to consider how the data actually looks on disk. This can be a bit tricky. Let's say you have a SequenceFile[IntWritable, IntWritable] and you write the tuple (42, 100). Hadoop serializes a IntWritable as an unsigned int32¹, so you'd need to query it as such:

$ curl localhost:9599/mydata/%00%00%00%2A | hexdump
0000000 00 00 00 64

¹. IntWritable represents a signed int, but it's cast first; so -42 would be ↩

%FF%FF%FF%D6.

Sparkey files

Sequins also supports Sparkey files as an input format with faster loading.

Ingesting SequenceFiles is often CPU-limited, as Sequins has to read the whole file and convert it to its internal storage format. This is also inefficient, as it has to be done on every single Sequins node, rather than once by whatever generates the data. If you provide your data in Sparkey format, Sequins can load your data about ten times faster.

Format details

The input directory should contain pairs of Sparkey log files and compressed Sparkey index files.

The Sparkey log files, with extension .spl, contain a list of keys and values. The keys in this file should be in lexicographically-sorted order.

The compressed Sparkey index files, with extension .spi.sz, associates keys with their locations in the log file.

Naming

Each log/index file pair must have the exact same basename, but with different extensions.

The name of each file must contain at least one sequence of digits. The first such run of digits must identify which key partition the file is for. For example, the file part-00012-00034.spl is for partition number 12.

It is legal to have multiple pairs of files from the same partition, for example with names part-00012-00034.spl and part-00012-00035.spl. However, such files must contain non-overlapping key ranges.

Compression

The Sparkey log files must use Sparkey's built-in Snappy compression. The index files must be compressed with framed-Snappy compression.

Sharding

Each log/index file pair must contain keys only from a single key partition. This means that a file with partition number 12 should only contain a key K if hashCode(K) % numPartitions == 12.

Key and value serialization

Sparkey files should contain data of type BytesWritable and Text pre-unwrapped. In other words, new Text("foo") should be stored as just foo.

However other Writable formats such as IntWritable should be stored in their canonical serialization format.

Generation

While SequenceFiles are trivial to generate in Hadoop, it's slightly more complex to generate Sparkey files. We've provided a Sparkey-generating version of the Hadoop Word Count example, to help you along.

Guidelines

While not required, Sequins operates best when the number and sizes of input files are under certain thresholds.

The number of input files (and hence partitions) affects the stability of Zookeeper. A maximum of 512 input files per version is a good threshold.

Input files that are very large are more likely to hang or fail during downloads. A good threshold here depends on your download speeds, but we prefer files below 5 GiB in size.

Data Requirements

Data Requirements

SequenceFiles

Naming

Compression

Sharding

Key and value serialization

Sparkey files

Format details

Naming

Compression

Sharding

Key and value serialization

Generation

Guidelines

results matching ""

No results matching ""