Currently Sequins supports two input file formats:
There are also a few guidelines for producing data for sequins.
The default input format for Sequins is SequenceFiles. These are typically generated by Hadoop Map/Reduce.
Sequins doesn't care at all how your SequenceFiles are named, but will ingest any files in a version directory.
You can use either
RECORD compression for SequenceFiles, or leave
them uncompressed. Both snappy and gzip are supported.
Sequins uses a sharding algorithm in which keys are
bucketed into N partitions, where the number of partitions is the number of
files in the dataset. To do this, it uses a hash function based on Java's String
hashCode function, chosen because it aligns coincidentally with the way Hadoop
often buckets keys into reducers.
In practice, this means that if your Hadoop job outputting the data has a reduce step, and the key is a String, and has the same value as the output key, then your data is effectively pre-sharded when it's written out, and each individual sequins instance can just download the data it needs, than rather scanning everything and downloading just the keys it wants.
While this obviously isn't a hard requirement, it can make loading new data much faster.
Key and value serialization
Hadoop often writes SequenceFiles with individual key and value serializations; in this way, SequenceFile is generic.
Sequins supports any key and value serialization, but has some optimizations for
the commons ones. In particular,
org.apache.hadoop.io.Text are unwrapped, and the actual underlying bytes
are used. For example, if you have a
SequenceFile[BytesWritable, Text] and a
record is saved as:
context.write(new BytesWritable("foo".getBytes("ASCII")), new Text("bar"))
Then you can query the value at
/mydata/foo, as you'd expect.
However, with other serializations, sequins doesn't do any converting for you,
and you'll need to consider how the data actually looks on disk. This can be a
bit tricky. Let's say you have a
SequenceFile[IntWritable, IntWritable] and
you write the tuple
(42, 100). Hadoop serializes a IntWritable as an unsigned
int321, so you'd need to query it as such:
$ curl localhost:9599/mydata/%00%00%00%2A | hexdump 0000000 00 00 00 64
1. IntWritable represents a signed int, but it's cast first; so -42 would be ↩
Sequins also supports Sparkey files as an input format with faster loading.
Ingesting SequenceFiles is often CPU-limited, as Sequins has to read the whole file and convert it to its internal storage format. This is also inefficient, as it has to be done on every single Sequins node, rather than once by whatever generates the data. If you provide your data in Sparkey format, Sequins can load your data about ten times faster.
The input directory should contain pairs of Sparkey log files and compressed Sparkey index files.
The Sparkey log files, with extension
.spl, contain a list of keys and values.
The keys in this file should be in lexicographically-sorted order.
The compressed Sparkey index files, with extension
.spi.sz, associates keys
with their locations in the log file.
Each log/index file pair must have the exact same basename, but with different extensions.
The name of each file must contain at least one sequence of digits. The first
such run of digits must identify which key partition the file is
for. For example, the file
part-00012-00034.spl is for partition number 12.
It is legal to have multiple pairs of files from the same partition, for example
part-00012-00035.spl. However, such
files must contain non-overlapping key ranges.
The Sparkey log files must use Sparkey's built-in Snappy compression. The index files must be compressed with framed-Snappy compression.
Each log/index file pair must contain keys only from a single key partition.
This means that a file with partition number 12 should only contain a key K if
hashCode(K) % numPartitions == 12.
Key and value serialization
Sparkey files should contain data of type BytesWritable and Text pre-unwrapped.
In other words,
new Text("foo") should be stored as just
However other Writable formats such as IntWritable should be stored in their canonical serialization format.
While SequenceFiles are trivial to generate in Hadoop, it's slightly more complex to generate Sparkey files. We've provided a Sparkey-generating version of the Hadoop Word Count example, to help you along.
While not required, Sequins operates best when the number and sizes of input files are under certain thresholds.
The number of input files (and hence partitions) affects the stability of Zookeeper. A maximum of 512 input files per version is a good threshold.
Input files that are very large are more likely to hang or fail during downloads. A good threshold here depends on your download speeds, but we prefer files below 5 GiB in size.