Getting large collections into Lucene via SequenceFiles

I had a problem while setting up my indexing job for the ClueWeb09 collection.  I wanted to do it from a MapReduce job: take each document, break it down into indexable chunks (title, URL, metadata, text, anchors) keyed by URL, sort around to accumulate the anchors for each document, then dump to an IndexWriter writing to local storage in the reduce phase.

Great idea, only Lucene would get into long periods of juggling files in the FSDirectory, and the reduce jobs would time out.  My cluster would run 36 reduce jobs, and each one at the end (using the different process I'm about to describe) produced a 55GB index of around 14 million documents.  Maybe this was expected, maybe not.

So here's what I did instead: the indexable chunks are sorted down in the reduce phase and written out to LZO-compressed SequenceFiles.  Here's how it works... this code is too closely tied up in our larger IR system to easily post the code up to a repository, so you'll have to make do with snippets and pseudocode this time around.

The Mapper job takes each document, parses out the headers, text, anchors, bacon, and coffee, and puts each one into what I call a ParseTuple.  A ParseTuple is a (key, value) pair.  Note that this is a simple WritableComparable.

import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;

public class ParseTuple implements WritableComparable<ParseTuple> {
    protected String label;
    protected String data;

    public ParseTuple() {
        label = null;
        data = null;
    }

    public ParseTuple(String a, String b) {
        label = a;
        data = b;
    }

    public void write(DataOutput out) throws IOException {
        Text.writeString(out, label);
        Text.writeString(out, data);
    }

    public void readFields(DataInput in) throws IOException {
        label = Text.readString(in);
        data = Text.readString(in);
    }

    public static ParseTuple read(DataInput in) throws IOException {
        ParseTuple pt = new ParseTuple();
        pt.readFields(in);
        return pt;
    }

    public int compareTo(ParseTuple other) {
        if (label.equals("ptext"))
            return -1;
        else if (other.label.equals("ptext"))
            return 1;
        else
            return label.compareTo(other.label);
    }

    public String toString() {
        return new String(label + ":" + data);
    }
}

The Mapper class as you know emits (key, value) pairs.  In this case, the mapper maps documents to (URL, ParseTuple) pairs.  The keys inside each ParseTuple are used to identify how I want each tuple indexed - what Lucene Field, essentially, the data should go into.

The Reducer is going to compile all the ParseTuples for a given URL into what I call a DocBits.  Again, it's a simple serializable Map, so that the reducer can emit a single value for each URL.  Here's DocBits:

import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Reducer;

public class DocBits implements Writable {
    protected HashMap<String, String> map;

    public DocBits() {
        map = new HashMap();
    }

    public void add(String k, String v) {
        map.put(k, v);
    }

    public void add(ParseTuple p) {
        map.put(p.label, p.data);
    }

    public void write(DataOutput out) throws IOException {
        for (Map.Entry<String, String> e : map.entrySet()) {
            Text.writeString(out, e.getKey());
            Text.writeString(out, e.getValue());
        }
        Text.writeString(out, "EOD");
    }

    public void readFields(DataInput in) throws IOException {
        while (true) {
            String k = Text.readString(in);
            if (k.equals("EOD"))
                break;
            String v = Text.readString(in);
            map.put(k, v);
        }
    }

    public static DocBits read(DataInput in) throws IOException {
        DocBits db = new DocBits();
        db.readFields(in);
        return db;
    }
}

The Reducer is very simple:

import java.io.IOException;
import java.util.Map;
import java.util.HashMap;
import java.util.regex.Pattern;
        
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class ReduceParseTuplesToDocBits
    extends Reducer<Text, ParseTuple, Text, DocBits> {
   
    public void reduce(Text key, Iterable<ParseTuple> values,
                       Context context)
        throws IOException, InterruptedException {
       
        DocBits doc = new DocBits();
        StringBuilder anchors = new StringBuilder(65536);
        StringBuilder title = new StringBuilder(8096);
        boolean have_parsed_doc = false;

        doc.add("url", key.toString());
        context.setStatus(key.toString());

        for (ParseTuple t: values) {
            if (t.label.equals("ptext")) {
                doc.add("ptext", t.data);
                have_parsed_doc = true;

            } else if (t.label.equals("title"))
                title.append(t.data).append(" ");
           
            else if (t.label.equals("anchor") && anchors.length() < 65536)
                anchors.append(t.data).append(" ");

            else
                doc.add(t);

            context.progress();
        }

        if (have_parsed_doc) {
            if (anchors.length() > 0)
                doc.add("anchor", anchors.toString());
            if (title.length() > 0)
                doc.add("title", title.toString());

            context.write(key, doc);
        }
    }
}

The reducer has to be a little clever, because the Mapper will emit anchortext for documents that aren't in the collection.  We could actually index that if we wanted to, but we don't, since our job is only to search within the collection.

The main class that runs the job sets up the Mapper, Reducer, and the compressed SequenceFile:

   public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        Job job = new Job(conf, "p3l.MapReduceIndexer");
        
        job.setJarByClass(this.getClass());
        LOG.info("Jar is " + job.getJar());

        job.setMapperClass(MapWebDocToParseTuple.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(ParseTuple.class);

        job.setReducerClass(ReduceParseTuplesToDocBits.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DocBits.class);
                
        job.setInputFormatClass(ClueWebInputFormat.class;); // default
        job.setOutputFormatClass(SequenceFileOutputFormat.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[1]));
        SequenceFileOutputFormat.setOutputPath(job, new Path(args[2]));
        SequenceFileOutputFormat.setCompressOutput(job, true);
        SequenceFileOutputFormat.setOutputCompressorClass(job, com.hadoop.compression.lzo.LzoCodec.class);
        SequenceFileOutputFormat.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK);
        
        // job.submit();
        return (job.waitForCompletion(true) ? 0 : 1);
    }

So at the end of this job, each Reducer has created a 26GB LZO-compressed SequenceFile ready to index:

[soboroff@node1 ~]$ hadoop fs -ls clue
Found 38 items
-rw-r--r--   3 soboroff hadoop           0 2011-02-08 14:09 /user/soboroff/clue/_SUCCESS
drwxrwxr-x   - soboroff hadoop           0 2011-02-07 15:38 /user/soboroff/clue/_logs
-rw-r--r--   3 soboroff hadoop 26327668533 2011-02-08 08:18 /user/soboroff/clue/part-r-00000
-rw-r--r--   3 soboroff hadoop 26339468375 2011-02-08 08:18 /user/soboroff/clue/part-r-00001
-rw-r--r--   3 soboroff hadoop 26330501868 2011-02-08 08:17 /user/soboroff/clue/part-r-00002
-rw-r--r--   3 soboroff hadoop 26366091781 2011-02-08 08:14 /user/soboroff/clue/part-r-00003
-rw-r--r--   3 soboroff hadoop 26321214072 2011-02-08 08:16 /user/soboroff/clue/part-r-00004
-rw-r--r--   3 soboroff hadoop 26349081494 2011-02-08 08:18 /user/soboroff/clue/part-r-00005
-rw-r--r--   3 soboroff hadoop 26343308176 2011-02-08 08:28 /user/soboroff/clue/part-r-00006
-rw-r--r--   3 soboroff hadoop 26352600382 2011-02-08 08:24 /user/soboroff/clue/part-r-00007
-rw-r--r--   3 soboroff hadoop 26321726649 2011-02-08 08:18 /user/soboroff/clue/part-r-00008
-rw-r--r--   3 soboroff hadoop 26339094476 2011-02-08 08:10 /user/soboroff/clue/part-r-00009
-rw-r--r--   3 soboroff hadoop 26321564045 2011-02-08 08:16 /user/soboroff/clue/part-r-00010
-rw-r--r--   3 soboroff hadoop 26343854112 2011-02-08 08:10 /user/soboroff/clue/part-r-00011
-rw-r--r--   3 soboroff hadoop 26315754762 2011-02-08 08:15 /user/soboroff/clue/part-r-00012
-rw-r--r--   3 soboroff hadoop 26346819081 2011-02-08 08:18 /user/soboroff/clue/part-r-00013
-rw-r--r--   3 soboroff hadoop 26364417290 2011-02-08 08:10 /user/soboroff/clue/part-r-00014
-rw-r--r--   3 soboroff hadoop 26345720864 2011-02-08 08:15 /user/soboroff/clue/part-r-00015
-rw-r--r--   3 soboroff hadoop 26325886676 2011-02-08 08:19 /user/soboroff/clue/part-r-00016
-rw-r--r--   3 soboroff hadoop 26352366823 2011-02-08 08:16 /user/soboroff/clue/part-r-00017
-rw-r--r--   3 soboroff hadoop 26363877289 2011-02-08 08:10 /user/soboroff/clue/part-r-00018
-rw-r--r--   3 soboroff hadoop 26346838673 2011-02-08 08:23 /user/soboroff/clue/part-r-00019
-rw-r--r--   3 soboroff hadoop 26334634232 2011-02-08 08:18 /user/soboroff/clue/part-r-00020
-rw-r--r--   3 soboroff hadoop 26338242486 2011-02-08 08:15 /user/soboroff/clue/part-r-00021
-rw-r--r--   3 soboroff hadoop 26333691832 2011-02-08 08:11 /user/soboroff/clue/part-r-00022
-rw-r--r--   3 soboroff hadoop 26351824723 2011-02-08 08:13 /user/soboroff/clue/part-r-00023
-rw-r--r--   3 soboroff hadoop 26340649075 2011-02-08 08:14 /user/soboroff/clue/part-r-00024
-rw-r--r--   3 soboroff hadoop 26348830107 2011-02-08 08:11 /user/soboroff/clue/part-r-00025
-rw-r--r--   3 soboroff hadoop 26341221269 2011-02-08 08:17 /user/soboroff/clue/part-r-00026
-rw-r--r--   3 soboroff hadoop 26318480301 2011-02-08 08:09 /user/soboroff/clue/part-r-00027
-rw-r--r--   3 soboroff hadoop 26314900021 2011-02-08 08:22 /user/soboroff/clue/part-r-00028
-rw-r--r--   3 soboroff hadoop 26335481459 2011-02-08 08:16 /user/soboroff/clue/part-r-00029
-rw-r--r--   3 soboroff hadoop 26329189695 2011-02-08 08:22 /user/soboroff/clue/part-r-00030
-rw-r--r--   3 soboroff hadoop 26347416304 2011-02-08 08:14 /user/soboroff/clue/part-r-00031
-rw-r--r--   3 soboroff hadoop 26340982688 2011-02-08 08:27 /user/soboroff/clue/part-r-00032
-rw-r--r--   3 soboroff hadoop 26327061702 2011-02-08 08:18 /user/soboroff/clue/part-r-00033
-rw-r--r--   3 soboroff hadoop 26331125728 2011-02-08 08:12 /user/soboroff/clue/part-r-00034
-rw-r--r--   3 soboroff hadoop 26328859269 2011-02-08 08:09 /user/soboroff/clue/part-r-00035

And lastly, I have a stand-alone application which uses a SequenceFile.Reader to read each DocBits and build it into a Lucene Document, which gets passed on to an IndexWriter.  This application doesn't run under MapReduce, so I don't worry about Lucene timeouts.  Also, I have the entire collection preparsed, so I can easily shove those bits into HBase or any other system I want to.

Benchmarking LZO compression in HBase

This is continued from my last post, Getting Clueweb Into HBase.  Comments from that post suggested trying LZO compression.  This required code from Kevin Weil and Todd Lipcon that implements LZO compression for Hadoop that works with CDH3b3, which is what I'm running.  I won't cover configuring LZO with Hadoop and HBase, since this is well documented in the documentation on the github site.

I created a new table, 'webtable2', with the additional option COMPRESSION => 'lzo' for the 'content' column family.  That is, the webpage content will be compressed, but the mapping from document identifiers to URLs is left uncompressed.  There certainly isn't any reason not to compress the 'meta' family too, but at this point I primarily wanted to test fetching pages out by URL and this is all in the 'content' table.

I reloaded all of ClueWeb09 into webtable2.  In contrast to my experience with the first load, loads took a consistent 3-4 hours per batch, which is probably attributable to having gone to 4GB regions, so a lot less regionsplits were taking place.  The result:

$ hadoop fs -du /hbase
Found 8 items
3409            hdfs://node1:9000/hbase/-ROOT-
35602480        hdfs://node1:9000/hbase/.META.
0               hdfs://node1:9000/hbase/.corrupt
3542474         hdfs://node1:9000/hbase/.logs
0               hdfs://node1:9000/hbase/.oldlogs
3               hdfs://node1:9000/hbase/hbase.version
14955092829826  hdfs://node1:9000/hbase/webtable
2808802017826   hdfs://node1:9000/hbase/webtable2

The uncompressed webtable takes up 14.9TB on HDFS to store 12.5TB of text and about 50GB of URL-id mapping (not bad overhead at all).  The LZO version, however, only takes up 2.8TB.  Already a good reason to consider using compression.  Since I have all my HDFS blocks replicated three times, this is significant storage savings!

I then wrote a very simple benchmark, where a single client process makes requests with URLs and waits to receive each page:

public class Bench {
    public static void main(String args[]) throws Exception {
    if (args.length != 2) {
        System.out.println("Usage: Bench [table] [inputfile]");
        System.exit(-1);
    }
   
    Configuration config = HBaseConfiguration.create();
    HTable table = new HTable(config, args[0]);

    BufferedReader in = new BufferedReader(new FileReader(args[1]));
    String query = null;
    long c_count = 0;
    long m_count = 0;
    long start = System.currentTimeMillis();
    long last = start;

    while ((query = in.readLine()) != null) {
        try {
        if (query.startsWith("http://")) {
            query = Util.reverse_hostname(query);
        } else if (query.startsWith("clueweb")) {
            Get g = new Get(Bytes.toBytes(query));
            Result r = table.get(g);
            byte[] value = r.getValue(Bytes.toBytes("meta"),
                          Bytes.toBytes("url"));
            query = Bytes.toString(value);
            m_count++;
        }
       
        Get g = new Get(Bytes.toBytes(query));
        Result r = table.get(g);
        byte[] value = r.getValue(Bytes.toBytes("content"),
                      Bytes.toBytes("raw"));
        c_count++;

        if ((c_count % 10000) == 0) {
            long now = System.currentTimeMillis();
            double sec = (now - last) / 1000.0;
            double rate = sec / 10000.0;
            System.out.println("("+c_count+") 10,000 queries in " +
                       sec + "s (" + rate + " s/q)");
            last = System.currentTimeMillis();
        }
                      
        } catch (IOException e) {
            continue;
        }
    }

    long end = System.currentTimeMillis();
    System.out.println("Fetched " + c_count + " content records.");
    if (m_count > 0) {
        System.out.println("Fetched " + m_count + " meta records.");
    }
    System.out.println("Total time: " + (end - start) / 1000.0 + "s");
    System.out.println("Time per fetch: "
               + ((end - start) / ((c_count + m_count) * 1000.0))
               + "s");
    }
}

This class reads URLs from a file to fetch.  The benchmark isn't measuring peak query response rate, since to do that it would make more sense to have lots of clients asking simultaneously.  However, it does measure a reasonable rate of requests to see if response times are sufficient and that we don't leak memory or anything else.  (My usage scenario only has perhaps a few dozen users.)

I took a sample of 500,000 URLs from the collection to use for testing.  Before running, I shut down HBase (master and region server processes) and started them up again.  (I neglected to shut down HDFS as well, perhaps I should have.)  First, running against the uncompressed table and printing timing information after every 10k requests:

(10000) 10,000 queries in 2936.995s (0.2936995 s/q)
(20000) 10,000 queries in 2723.424s (0.2723424 s/q)
(30000) 10,000 queries in 1781.611s (0.17816110000000002 s/q)
(40000) 10,000 queries in 774.718s (0.0774718 s/q)
(50000) 10,000 queries in 844.356s (0.0844356 s/q)
(60000) 10,000 queries in 1300.254s (0.13002539999999999 s/q)
(70000) 10,000 queries in 1276.55s (0.127655 s/q)
(80000) 10,000 queries in 1261.369s (0.1261369 s/q)
(90000) 10,000 queries in 1207.213s (0.12072129999999999 s/q)
(100000) 10,000 queries in 1206.334s (0.1206334 s/q)
(110000) 10,000 queries in 1404.424s (0.1404424 s/q)
(120000) 10,000 queries in 1209.425s (0.1209425 s/q)
(130000) 10,000 queries in 1137.359s (0.11373589999999999 s/q)
(140000) 10,000 queries in 800.359s (0.08003590000000001 s/q)
(150000) 10,000 queries in 943.502s (0.0943502 s/q)
(160000) 10,000 queries in 724.696s (0.07246960000000001 s/q)
(170000) 10,000 queries in 975.958s (0.0975958 s/q)
...

The times seem to hover around 1/10s per query, which is perfectly reasonable for my usage scenario.  (If it wasn't, the thing to do is to spread the regions across more regionservers.  There is also a request bottleneck at the .META. table which is an HBase limitation.)  There is also a lot of variation in time between batches, with the shortest average 0.077s/q and the longest 0.29s/q.  It doesn't get shorter consistently through the run, so caching is not helping beyond a certain point... there are simply too many regions to cache efficiently.  System load on the cluster nodes was not an issue at all.

Now, here is timing of the same sequence of fetches (on a clean startup of HBase) against the compressed webtable:

(10000) 10,000 queries in 561.779s (0.0561779 s/q)
(20000) 10,000 queries in 484.254s (0.0484254 s/q)
(30000) 10,000 queries in 512.418s (0.051241800000000004 s/q)
(40000) 10,000 queries in 514.108s (0.05141079999999999 s/q)
(50000) 10,000 queries in 515.848s (0.05158479999999999 s/q)
(60000) 10,000 queries in 513.325s (0.0513325 s/q)
(70000) 10,000 queries in 456.037s (0.0456037 s/q)
(80000) 10,000 queries in 472.251s (0.0472251 s/q)
(90000) 10,000 queries in 459.83s (0.045982999999999996 s/q)
(100000) 10,000 queries in 488.919s (0.048891899999999995 s/q)
(110000) 10,000 queries in 485.988s (0.0485988 s/q)
(120000) 10,000 queries in 468.895s (0.0468895 s/q)
(130000) 10,000 queries in 485.07s (0.048507 s/q)
(140000) 10,000 queries in 488.118s (0.0488118 s/q)
(150000) 10,000 queries in 480.102s (0.048010199999999996 s/q)
(160000) 10,000 queries in 461.39s (0.046139 s/q)
(170000) 10,000 queries in 509.506s (0.0509506 s/q)
...

With compression, query batches average twice as fast to complete, and also those times are much more consistent.  Compressing takes 20% of the space in HDFS (before replication) and provides much faster query response times.  Win!

Getting Clueweb into HBase

I have a simple webtable in HBase to hold the Clueweb09 collection.  The english portion of Clueweb09 is around 500 million web pages or 12.5TB of data.  I recommend reading the above link for details on Clueweb09; in TREC, we are using it in several tracks focusing on different aspects of web search.

I wanted to put Clueweb into HBase to make it easy to fetch individual web pages from the collection and show them to a user, who then analyzes the web page and determines if it is relevant to some search topic.  We have an existing method for this, but it doesn't scale to large collections.  My simple webtable has the following structure:

hbase(main):002:0> create 'webtable', {NAME => 'content', BLOCKSIZE => '1048576'}, 'meta'

That is, two column families: 'content' and 'meta'.  Content maps a url to the web page content.  Meta is for general metadata, but at the moment it just maps the internal Clueweb document identifiers to the corresponding url.  Using this structure, I can retrieve documents either by url or by document identifier, and support fetching individual documents as well as browsing within the collection.  The content content family has a larger blocksize (1MB), but otherwise there are no changes from the stock table settings.

My cluster is fairly small in terms of cores and memory, but large on storage.  I have 14 physical nodes, each with 8 cores, 8GB of RAM, and 12.5TB of storage disk in seven spindles.  The first node is the NameNode, JobTracker, and HBase master, and has its storage striped into a RAID-5.  The second node mirrors the namenode storage, and also acts as the SecondaryNameNode.  The remaining twelve nodes keep the seven data disks separate, and each run a DataNode, TaskTracker, and RegionServer.  I'm running Cloudera's CDH3 beta (737) on top of CentOS-5.

For Hadoop's configuration, I have HDFS replicating each block to three locations.  The processes on the NN get more heap, but on the workers, heap is limited to 1GB per process.  I use the FairScheduler and allow 3 mappers and 3 reducers to run on each host. 

For HBase, I started with a region filesize of 2GB.  I based this on estimating that I wanted around 500 regions per node once all the data was loaded, and 500 * 2GB * 12 equals around 12.5TB.  Later on as I was loading data, I found that I was getting more than 700 regions per node and timeouts during put calls, so I bumped the region max to 4GB, and added an hbase.client.pause of 5000 (5ms).

I split the collection into ten pieces, so I could load it a piece at a time and start over when I needed to.  At the beginning, I loaded 1.25TB of text in 4 hours.  As more and more data was loaded, this crept up to 8 hours.  I think the time would have been kept lower overall if I'd started with 4GB regions, rather than loading nearly all the data with 2GB regions and bumping it up near the end.

Now, some code.  Below I'm including snippets; you can find the full source code at https://github.com/isoboroff/clueweb-hbase.  First is an input format for reading the ClueWeb documents.  It has some helpful machinery for traversing directories, but its basic job is to read a single WARC entry at a time and return it as a String.


public class ClueWebInputFormat extends FileInputFormat<LongWritable, Text> {

    public static final Log LOG =
        LogFactory.getLog(ClueWebInputFormat.class);

    @Override
    public boolean isSplitable(JobContext job, Path filename) {
        return false;
    }
   
    @Override
    public RecordReader<LongWritable, Text>
        createRecordReader(InputSplit split, TaskAttemptContext context)
        throws IOException {
        ClueWebRecordReader rr = new ClueWebRecordReader();
        rr.initialize(split, context);
        return rr;
    }

    public static class ClueWebRecordReader
        extends RecordReader<LongWritable, Text> {
        private long start;
        private long end;
        private long pos;
        private Path path;
        private LineRecordReader in;
        private LongWritable cur_key = null;
        private Text cur_val = null;

        public ClueWebRecordReader() {
        }
           

        public void initialize(InputSplit split, TaskAttemptContext context)
            throws IOException {
            cur_key = new LongWritable(0);
            try {
                if (split instanceof FileSplit) {
                    path = ((FileSplit)split).getPath();
                } else {
                    path = new Path("");
                }
                start = 0;
                pos = 0;
                end = split.getLength();
                in = new LineRecordReader();
                in.initialize(split, context);
            } catch (InterruptedException ie) {
                throw new IOException(ie);
            }
        }
   
        public LongWritable getCurrentKey() {
            return cur_key;
        }
        public Text getCurrentValue() {
            return cur_val;
        }

        private Text hold = null;
        private long last_pos = 0;
        StringBuilder buf = null;

        public synchronized boolean nextKeyValue()
            throws IOException {
            Text line = null;
            cur_val = new Text();
            boolean in_doc = false;

            if (buf == null)
                buf = new StringBuilder();

            if (hold != null) {
                buf.append(hold.toString()).append("\n");
                hold = null;
                in_doc = true;
            }
           
            try {
                while (in.nextKeyValue()) {
                    line = in.getCurrentValue();
                    int size = line.getLength();
                    last_pos = pos;
                    pos += size;

                    if (line.find("WARC/0.18") == 0) {
                        if (in_doc) {
                            in_doc = false;
                            hold = line;
                            break;
                        } else {
                            in_doc = true;
                            continue;
                        }
                    }

                    if (in_doc)
                        buf.append(line.toString()).append("\n");
                }
            } catch (java.io.IOException e) {
            }

            if (buf.length() > 0) {
                cur_val.set(buf.toString());
                cur_key.set(cur_key.get() + 1);
                buf = null;
                return true;
            } else {
                LOG.info("nkv returning false");
                return false;
            }
        }

        public float getProgress() {
            return Math.min(1.0f, (pos) / (float)(end));
        }

        public synchronized void close() throws IOException {
            if (in != null) {
                in.close();
            }
        }
    }
}

Next is the MapReduce class to load the data.  This is adapted from example code that comes with HBase.


public class LoadClue {

    private static final String NAME = "LoadClue";
 
    private static String reverse_hostname(String uri) {
        URL url = null;
        try {
            url = new URL(uri);
        } catch (MalformedURLException mue) {
            return null;
        }
        String host = url.getHost();
        StringBuilder newhost = new StringBuilder();
        String[] parts = host.split("\\.", 0);
        for (int i = parts.length - 1; i > 0; i--) {
            if (i > 0)
                newhost.append(parts[i]).append(".");
        }
        newhost.append(parts[0]);
        int port = url.getPort();
        if (port != -1)
            newhost.append(":").append(port);
        newhost.append(url.getFile());
        return newhost.toString();
    }

    private static HashMap<String, String> get_headers(String doc) {
        HashMap<String, String> hdr = new HashMap(20);
        try {
            BufferedReader in = new BufferedReader(new StringReader(doc));
            int nl = 0;
            String line = null;
            while ((line = in.readLine()) != null) {
                if (line.length() == 0)
                    nl++;
                if (nl == 2)
                    break;
                int i = line.indexOf(':');
                if (i == -1)
                    continue;
                try {
                    hdr.put(line.substring(0, i), line.substring(i+2));
                } catch (Exception e) {}
            }
            StringBuilder buf = new StringBuilder();
            while ((line = in.readLine()) != null) {
                buf.append(line).append('\n');
            }
            hdr.put("document", buf.toString());
        } catch (IOException e) {}
        return hdr;
    }

    protected static String table_name = null;

    protected static void setTableName(String n) {
        table_name = n;
    }

    static class Uploader
        extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {

        private long checkpoint = 1000;
        private long count = 0;
   
        @Override
            public void map(LongWritable key, Text value, Context context)
            throws IOException {

            String raw = value.toString();
            HashMap<String, String> parse = get_headers(raw);

            if (parse.get("WARC-Type").equals("response")) {
                String uri = parse.get("WARC-Target-URI");
                if (uri == null) {
                    System.err.println("Doc has no target-uri");
                    return;
                }
                String keystr = reverse_hostname(uri);

                byte[] row = Bytes.toBytes(keystr);
                byte[] family = Bytes.toBytes("content");
                byte[] qualifier = Bytes.toBytes("raw");
                byte[] val = Bytes.toBytes(parse.get("document"));

                // Create Put
                Put put = new Put(row);
                put.add(family, qualifier, val);
     
                // Uncomment below to disable WAL. This will improve
                // performance but means you will experience data loss in
                // the case of a RegionServer crash.
                put.setWriteToWAL(false);
               
                String trecid = parse.get("WARC-TREC-ID");
                byte[] row2 = null;
                Put put2 = null;
                if (trecid != null) {
                    row2 = Bytes.toBytes(parse.get("WARC-TREC-ID"));
                    put2 = new Put(row2);
                    byte[] fam2 = Bytes.toBytes("meta");
                    byte[] qual2 = Bytes.toBytes("url");
                    byte[] val2 = row;
                    put2.add(fam2, qual2, val2);
                    put2.setWriteToWAL(false);
                }

                try {
                    context.write(new ImmutableBytesWritable(row), put);
                    if (trecid != null)
                        context.write(new ImmutableBytesWritable(row2), put2);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
               
                // Set status every checkpoint lines
                if(++count % checkpoint == 0) {
                    context.setStatus("Emitting doc " + count);
                }
            }
        }

        @Override
            public void cleanup(Context context)
            throws IOException {
            if (LoadClue2.table_name == null)
                return;
            context.setStatus("Sending flush");
            HBaseAdmin admin = new HBaseAdmin(context.getConfiguration());
            admin.flush(LoadClue2.table_name);
        }
    }
 
    /**
     * Job configuration.
     */
    public static Job configureJob(Configuration conf, String [] args)
        throws IOException {
        Path inputPath = new Path(args[0]);
        String tableName = args[1];
        Job job = new Job(conf, NAME + "_" + tableName);
        job.setJarByClass(Uploader.class);
        FileInputFormat.setInputPaths(job, inputPath);
        job.setInputFormatClass(ClueWebInputFormat.class);
        job.setMapperClass(Uploader.class);
        LoadClue2.setTableName(tableName);

        // No reducers.  Just write straight to table.  Call initTableReducerJob
        // because it sets up the TableOutputFormat.
        TableMapReduceUtil.initTableReducerJob(tableName, null, job);
        TableMapReduceUtil.addDependencyJars(conf, TableOutputFormat.class);
        job.setNumReduceTasks(0);
        return job;
    }

    /**
     * Main entry point.
     *
     * @param args  The command line parameters.
     * @throws Exception When running the job fails.
     */
    public static void main(String[] args) throws Exception {
        Configuration conf = HBaseConfiguration.create();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if(otherArgs.length != 2) {
            System.err.println("Wrong number of arguments: " + otherArgs.length);
            System.err.println("Usage: " + NAME + " <input> <tablename>");
            System.exit(-1);
        }
        Job job = configureJob(conf, otherArgs);
        boolean result = job.waitForCompletion(true);
        System.out.println("Flushing table " + otherArgs[1]);
        HBaseAdmin admin = new HBaseAdmin(conf);
        admin.flush(otherArgs[1]);
        System.exit((result == true) ? 0 : 1);
    }
}

The ladder of user tracking and privacy

Privacy and behavior tracking by websites and web-connected products has long been a concern of privacy advocates and some in the technical sphere.  Recent events prodded me to compose this post to assemble my thoughts on the matter.  Those events include a decision by the courts that hosted email is protected under the Fourth Amendment; a story on NPR that ebook readers collect and transmit usage data; the emergence of Diaspora, a Facebook alternative, into alpha testing; and concerns over electronic voting.  These thoughts are mine and do not represent my employer in any way.

As active participants in a digital society, we are the key ingredient in online services and connected products, providing the entire revenue stream through our behavior.  However, we otherwise cannot participate in the market created by digital behavior tracking.  We have little awareness of what data is collected, no control over how that data is used, and no method of controlling it.  This is because the company or entity doing the tracking owns that data.  I'd like to propose that since we ourselves create that data, we should be able to participate in the process whereby that data is used.

Currently, privacy and tracking are modulated through privacy policies.  Such a policy aims to inform you, the user, about what data is collected and how it might be used.  Privacy policies have several problems.  They are written by lawyers, and as such may be hard for users to understand.  The policy provides the user no leverage aside from deciding not to use the service.  The user cannot determine that the policy is being honored.  The policy may be vague about what is actually collected and how actually that may be used or sold, and there is no avenue for a user to learn more.  If the policy is violated and the user learns about it, avenues for redress are few, expensive, and largely untested.

What is the cost of a privacy breach?  If it results in actual identity theft, then actual damages may be calculable.  However, there may be other breaches and other costs whose damage may be harder to assess.  On the other side, how much is your behavior worth?  Personal information has definite value both in isolation and in aggregate, to the user, to the collector, and to third parties.
 
I propose the following 8-step ladder of user tracking.  At each step of the ladder is a question for you to answer regarding an online service such as Facebook or a connected product such as a Kindle.  At the point of the ladder where the answer is "no" or "I don't know", you stop.

  1. Do you know that the site/company/product tracks your use?
  2. Can you determine when tracking is occurring?
  3. Do you know what activities are tracked?
  4. How is your tracked usage being used by the site/company/product?
  5. How is your usage used by others?
  6. Can you obtain your usage data from the site/company/product?
  7. Can you dictate whether or not your usage data is used?
  8. Can you license how your usage data is used?
Something I've specifically not included on the ladder is knowing how your tracked data is stored, and how secure that storage is.  I don't think this should even be on the ladder.  If a company holds my personal information but in an insecure environment, they should be fully liable for damage to me by their negligence.  If a company hews to best practices in information security, and are nonetheless hacked and my personal data stolen, that sounds to me like a risk that companies can insure against.  But secure storage of personal information is too fundamental to be placed on my ladder.

Level 1 is basic, and you might assume the answer is "yes", but rather than assuming, take time to see if you can figure out what's happening.  Your web browser should let you look through the cookies stored in it, so you can see if the site has placed any there.  If so, you should assume at least your presence on that site is being tracked by that site.  For devices such as a Kindle, a good approach is to connect over wifi, and use your home router to watch if it calls home, and how often.

Level 2, "Can you determine when tracking is occurring", might become apparent during investigation of level 1.  For example, if you find a cookie for a site, you should know that this cookie is sent to that host with every HTTP request to that host.  Additionally, once a site is loaded, Javascript code in the webpage can make HTTP requests behind the scenes, so the cookie may be read much more often than you click on things.  The geeks among us can enable certain things in the operating system or on our routers to watch these accesses, but most firewall tools aimed towards "regular users" are so intrusive they're hard to leverage for this problem.

Level 3 is "Do you know what activities are tracked".  A privacy policy may give some detail on this, but you can't verify that against the actual application; usually the policy is written broadly enough that it doesn't have to be modified for every new application feature.  When a site tries to connect to your Facebook account, it makes a claim to you as to what information that application will use; for example, if I enable Posterous' FB autoposting, Posterous can not only update my status when I post to my blog, it apparently gets permission to access my FB account even at other times.  Again, the claim is not verifiable, but it's nice that FB does this.

Above level 3 we are into territory which no current online service or connected device can lay claim.  A level 4 service would tell us how our usage and behavior data are being used.  I don't mean in a nebulous "improving the product experience" way, but actually report on when usage data is used or mined and for what purpose.

Stop and imagine that for a moment.  Imagine that you could actually know specifically how your tracked behaviors were being used.  You might decide that you approve of that usage, that you consider the outcome to be to your benefit as a customer.  Or you might not.  Right now, you just don't know, and that's the primary concern of privacy advocates.

Level 5 takes us beyond the site of concern, to other sites that our data is either shared with or sold to.  At present, we don't know if that data is used in aggregate or in a personally-identifiable way.  A privacy policy typically allows the site to share or sell usage data, but when that happens, we now have even less connection to that data.

Level 6 asks if we can obtain our tracked data from the site in question.  This idea is somewhat similar to how Facebook allows you to download your profile, posts, and friend connections, but specifically refers to tracked usage data.  This idea, to me, actually seems the most straightforward notion that might exist in the debate on online privacy -- that you, the user, should at least know what the company or site or product knows about you.

Level 7 asks, "Can you dictate whether or not your usage data is used."  Some might rather place this lower on the ladder, but I think that's counterproductive for all participants.  If you don't know what's being collected and tracked, you can't make an informed choice about whether to participate in that.  Moreover, the current implementation from the users perspective requires high levels of vigilance and might not be reliable.  However, if we arrive at level 7 after the first 6 levels, then we can have a constructive discussion about whether or not our personal data is used.

Level 8 is where I began thinking about this problem.  Usage data exists in a very active market, but we, the creators of that data, can't participate in that market.  Imagine that you own your behavioral data; you created it, so by some notion along the lines of copyright, you should in some fashion be considered its owner.  "Wait," you say, "we enact those behaviors within a system or with a device, so really it should be joint ownership, right?"  Hey, that would be great.  If I and Facebook own my data in partnership, we now have a whole new framework for discussing how it should be used.  And that framework is licensing.

Perhaps you're an open-source kind of fellow.  You might decide to license your behavior along a Creative-Commons style agreement, which protects your ownership and still requires that you have access through all levels of the ladder to your own data.  Or perhaps like most of us, you'd like to choose who to share your information with.  With licensing, that framework exists.

Me, I'm happy to share my personal information and behavior, provided I have a share in its income.  Think micropayments.  Mechanical Turk, turned on its head.  Come on, Google, let's make a deal.

A simple gardenhose catcher

I wanted a simple script to pull tweets from Twitter's gardenhose feed (the "sample API", as it's now called) and dump it in files, but couldn't find exactly what I needed online.  Below is a tweak of the "spritz.py" example from the twitstream Python library.  Posting in case someone else finds it useful.


#!/usr/bin/env python26

# The key module provided here:
import twitstream
from time import strftime, localtime
import json

# Provide documentation:
USAGE = """%prog [options]

Show a real-time subset of all twitter statuses."""

# Define a function/callable called on every status:
class Store(object):
    def __init__(self):
        t = localtime()
        self.cur_hr = t.tm_hour
        self.cur_out = open(strftime("output.%d-%m-%y.%H", t), "w")

    def __call__(self, status):
        t = localtime()
        if (t.tm_hour != self.cur_hr):
            if (not self.cur_out.closed):
                self.cur_out.close()
            self.cur_hr = t.tm_hour
            self.cur_out = open(strftime("output.%d-%m-%y.%H", localtime()), "w")
       
        json.dump(status, self.cur_out)
        self.cur_out.write("\n")

if __name__ == '__main__':
    # Inherit the built in parser and use it to get credentials:
    parser = twitstream.parser
    parser.usage = USAGE
    (options, args) = parser.parse_args()
    twitstream.ensure_credentials(options)

    cb = Store()
   
    # Call a specific API method in the twitstream module:
    stream = twitstream.spritzer(options.username, options.password, cb,
                                 debug=options.debug, engine=options.engine)
   
    # Loop forever on the streaming call:
    try:
        stream.run()
    finally:
        cb.cur_out.close()

xfs_repair and immense filesystems

Just a quickie here.  If you have a ginormous XFS filesystem, xfs_check may report "out of memory" and xfs_repair may churn with 100% CPU during phase 6 ("traversing filesystem").

The answer is to not use xfs_check, but instead do "xfs_repair -n".  Also, add the "-P -o bsize=1024" options.  Then keep an eye on xfs_repair's %CPU in top(1).  If it starts churning at 99% or 100%, then kill it and bump the bsize larger.

I had one filesystem I had to boost bsize to 10240 and isize to 4096 before it would make it through the scan.

Caveat -- this was on a NAS box running a variant of RHEL4.  I was using xfs_repair v2.9.8, which was the latest I could find without cooking up a complete RHEL4 build environment.  It's possible the XFS folks have fixed this one in a more recent version.

Victor Wooten and JD Blair at the Birchmere, 13 October 2010

I went to this show last night with my father.  In a word, the performance was astounding.  The set opened with a jam between Victor and JD, then moved into "Me and My Bass Guitar".  Several of Victor's children (ages 7, 9, and 12) performed with him on vocals and drums for a few songs, including an India.Arie cover and an original song by his daughter.  Also appearing were bassist and local teacher extraordinaire Anthony Wellington, and his student Cole Sipe (somebody please correct my spelling if I mangled it!).  Vic played an extended solo set including Amazing Grace, Norwegian Wood, The Lesson, and a lot of incredible loop pedal work.  The closer was "U Can't Hold No Groove (If You Ain't Got No Pocket)".

Victor also shared lots of thoughts on music education, music as a language, and relationships between people of all types and backgrounds.  He is always worth listening to.  A good (and free) start is his visit to NPR's Talk of the Nation two days ago.  I am always energized and inspired when I listen to him play (especially live!).

Thanks, Victor (and all).

Storing and processing Big Data

I was asked, as part of a symposium at the Library of Congress, to speak briefly on the subject of requirements for storing and processing big data, including social media and web datasets.  This post is meant to accompany that talk as a list of references and links.  Some of the thinking here is influenced by Dataists inaugural post on the topic by Hilary Mason and Chris Wiggins, which itself reminds me a lot of Kernighan and Plauger's book Software Tools.

(Disclaimer: in this post I will mention many software products and some companies.  These mentions do not imply any endorsement of those companies or products, but are meant to be illustrative of the state of the art and common usage in the field.)

What is "big data", anyway?  The generations of yore (like, 20 years ago) measured it in gigabytes.  In 1991, TREC brought the information retrieval community (the researchers who studied what would come to be called "search engines") from working with megabytes to a two gigabyte text corpus.  This caused tremendous engineering havoc in the research world; we might assume that industrial groups of the time already worked with data an order of magnitude or more larger.  Today's largest web collection made generally available to researchers (at cost! thanks to NSF), CLuEWeb09, is 25 terabytes of raw web pages, approximately equivalent to the top tier of a commercial web search engine.  Social media datasets, such as the Twitter archive to be housed at LOC, exist on different scales; the text in such a collection might be a hundreds of gigabytes or a few terabytes, but occupy a graph structure of billions of nodes.

So the answer is, "big data" is data that's bigger than what you can comfortably store and process right now.  As a comparative, if you've got it, it probably isn't big anymore ("My data is bigger than your data.")  Or, perhaps more realistically, "You thought that was big, wait to see what's coming down the pipe!"

I won't divide up into sections on storage and processing, because they depend on each other... if you store things in a relational database, that implies certain kinds of processing; if you first think of streaming over data, that implies certain kinds of storage.  My focus is also on what can be done with commodity hardware or cloud resources... I'll try to be up-front about infrastructure costs when I can.  I'm also cheap, so I tend to favor free solutions; the good news in 2010 is that free is as good or better than anything you can pay for, and the next best thing is pretty darn cheap.

You'd be alarmed at what a modern desktop computer can handle.  A terabyte of hard disk in 2010 costs about $90, plus another $20 if you want it to fit in your laptop.  Two-core processors are now previous-generation; four and eight cores on the desktop are more and more common.  2 GB of RAM is too small for Windows these days, so 4 and 8GB is getting more common.  You can run Windows, Linux, or Mac OS on essentially equivalent hardware, so you can pick the platform you like best.  My experience is in the Unix world, so my perspective is going to center primarily on Linux, Mac OS, or Windows with Cygwin, but that's not a functional requirement of anything here.  This roughly plain-vanilla desktop computer will cost you around $1500-2000 and can handle a terabyte or three of data.

"Handle" in the context of this article means that you can process the data, slice, dice, and explore it, probably not at interactive speeds, but that you're willing to wait a few minutes or an hour for results.  In the commercial world, the issue is scaling up to serving results in real-time while the data keeps flooding in... I'm not going there right now.

In the Unix world, the best tools come for free; see the aforementioned Software Tools book for philosophical perspective (but not necessarily freedom).  For the neophyte I recommend Think Unix by Jon Lasser.  Using basic Unix tools, you can

  • reformat your data (sed, awk, perl, tr, and lots more)
  • search your data (grep and others)
  • cut it into fields (cut, awk)
  • transform it in any number of ways (you name it!)
  • count occurrences (uniq, nl, grep)
  • sort data (sort)

and do all kinds of exploratory data analysis.  These tools work on plain text files, which is a great simple file format.  They work in a stream fashion, so you can pipe them together to make complex operations.  (My favorite is using "sort | uniq -c | sort -n" to count repeat occurrences, then produce something like a histogram.)  On modern Unixes (i.e., anything you're likely to encounter now), those big text files get cached as you read them and so you can actually scream through data pretty quickly.  This works great for exploring data and trying out ideas without writing or buying a single piece of software, but just using the screwdrivers and wrenches that came with the operating system.

As an example of this usage, I help run a large evaluation in experimental web search.  As part of this, folks send me large ranked lists of documents from the CLuEWeb collection, and I show those web pages to people who decide what's relevant and what's not.  When I get the lists, the documents are keyed by an identifier, but I like to show my users the URL as well since that can be a helpful criterion.  I have a master list of all 1 billion URLs in CLuEWeb with their identifiers, a 50GB text file.  Using just the tools I mentioned above, I can join the identifier lists against the master URL list, and produce my URL lists for my users to review, in a few minutes on my desktop Mac.

Many of you would point out that this is an obvious job for a relational database, and I might even agree except I'm not such a database person.  However, I could put both the master URL list and the ranked lists I receive into a database and get much the same result from a simple query or two.  Databases seem alarming at first but in reality they're a simple way to store and process data, provided you have a good idea about how to store it from the beginning.  The good news is that you can make bad layout decisions and the database will humor you until you have simply too much data.  As with Unix tools, there are very good free tools, such as MySQL and PostgreSQL.  There are also expensive commercial solutions, but again, remember the goal here is to explore big data and be a researcher, not to field a commercial realtime solution.  MySQL and PostgreSQL will run just fine on your standard desktop machine and won't cost you a dime.

Another common idiom for processing big data is to use scripting languages like Python or Perl.  Scripting languages are called that because they didn't used to be compiled; nowadays everything is compiled just-in-time.  The key point is that these languages support rapidly turning a good, complicated idea into a short, fast, reusable computer program.
If writing computer programs drove you nuts 10 years ago, come back and give Python a try.  Again, these tools are free and run on any operating system.

Since I started out saying that if you can hold it in your hand, it isn't big data anymore, I will spend some time talking about how to scale up a storage and processing infrastructure.  Databases do this well for certain kinds of data, but this can be complicated and costly.  The hot idea nowadays is "cloud computing".  Without getting into a buzzword battle, let's call a bunch of equivalent, interchangeable, and anonymous computers working together a cloud.  (Just like a weather cloud is a bunch of equivalent, interchangeable, anonymous droplets of water vapor, working together).  These days you can make your own cloud or rent one.

I built my own little cloud last year, so I can talk about how this is done.  Each individual computer costs around $1500 and has 8 cores and 8GB of memory.  It also has 8 slots for disks; 8 1.5TB disks costs around $800, so the total cost per computer is $2300.  In cloud parlance, each computer is called a "node".  There is also a certain amount of infrastructure in the form of a computer rack, a network switch, and two uninterruptible power supplies.  A single rack holds 15 nodes, or 15 * 8 * 1.5 = 180 TB of raw disk.

The key point of cloud computing, in the context of this article, is to keep your data close to the computer doing the work on it.  If your agency or company has a central file store, you are probably aware that your computer has to copy files from the server before you can work on them.  In our little cloud, the idea is that each CPU core is going to work on the data that is sitting on the disks on that machine, and avoid copying things as much as possible.  This way, all the cores can work at the same time without waiting for each other.  This is why my design above has one disk per core per GB of RAM.  (A better system would have 2-4GB of RAM per core; I'm waiting for RAM prices to fall.)

The common software for computing on a cloud like this is Hadoop.  Hadoop includes two key components: a storage infrastucture for making all those disks look like a single disk, and making it reliable; and a programming paradigm called Map-Reduce.  Map-Reduce is a style of programming that maximizes letting all those cores work on a part of the problem, then merging the results together easily.  In the Map step, each core gets a piece of the data and can transform it into one or more other pieces of data.  After the Map step, all the data on all the machines gets sorted in parallel so that the data is in order.  The Reduce step, the sorted data are merged together.  This paradigm is extremely flexible and a lot of problems are easy to structure this way.

One example is indexing a large web crawl.  If you want to search the web, you need to note down all the words in all the web pages, and put them into an index, so that when someone gives you a query with a word, finding the pages that contain that word is fast.  To think about this problem in Map Reduce, imagine that we have a series of web pages, 1 to 1 billion, and each page is a URL and its content.  We'll write this as (URL, content); geeks call that a tuple.  In the Map step, we take each (URL, content) tuple, cut the content into words, and output a series of tuples (word, URL) for each word in the page.  After mapping, these tuples get sorted by word.  The Reduce step collects all the (word, URL) tuples for a single word and outputs the index fragment (word, (URL, URL, ...)), all the URLs for that word.  These get stored someplace where we can easily find them by word.  Voila, search engine.  Well, almost.  The key point is that Map-Reduce automates the splitting up the work and the sorting across a cloud, and gives you a way to think about how to break down problems efficiently in that modality.

Lots of graph problems, like those that come up in social media, break down the same way.  Graphs in this context are collections of "nodes" and "edges", where edges connect two or more nodes.  Nodes might be people, and edges might be friendship relations.  Edges are tuples in the Map Reduce framework.  Sometimes, a special graph database is called for.  This can help for storing a large graph structure once and automating certain kinds of processing on the graph.  A good graph database, again free, is Neo4J.

A nice Windows tool for graph analytics is NodeXL.  It runs in Excel and allows exploration of small-to-medium datasets, well, as large as Excel might let you scale.  Its author, Marc Smith, is a prominent figure in social network study.

The Hadoop ecosystem includes lots of tools for thinking about big data problems as streams, databases, flow networks, and more.  My favorite tool for statistical analysis is R.  Folks have been working on extending it to Hadoop; I would appreciate some good links on this as I'm not too familiar with this yet. 

Renting a cloud in many cases makes more sense than building one.  Amazon's Elastic Map-Reduce allows you to essentially borrow computers at Amazon.com, link them together in a Map Reduce cloud, run your experiment, and store all your data on their cloud-based storage service.  RackSpace is a company which does managed hosting and will essentially rent you cloud.  Other companies exist; again, the above are not an endorsement but just examples of what people use.

Further reading

Blogs: (others happily included, pls comment)

Academic conferences: (lots of conference are concerned with big data these days... the following are primarily focused on web and social media.  Again, comments appreciated)

New Bass Day: Hohner B2B headless bass

Well, it actually arrived Friday, but things have been a little busy.

Guitar Center in Avondale, AZ had this used Hohner B2B for sale at a very good price. I called and spoke with Eddie Rojas, who I am singling out for incredible customer service. He got the bass and told me about it over the phone, and emailed me several pictures of the bridge, pickups, and body, which seemed to show generally good condition with maybe some light finish scratches. I agreed to buy it and arranged for shipping.

The bass arrived in a guitar box, wrapped and very well packed. And perfect looking. There is one finish blemish on an edge, but otherwise it looks brand new. The photos really highlighted textures in the poly which just aren't there. Tuners were easy to adjust, the strings didn't look too old, generally beautiful.

I brought the strings up to tension, tuned it approximately, and left it overnight to settle in. This morning I plugged it up. Details - this is a Steinberger clone and has a licensed Steiny bridge and headpiece, but that's pretty much where the similarity ends. It has a bolt-on maple neck, rosewood fretboard, maple body with a heavy poly finish, passive P-J pickups and VVT pots. Haven't looked at the wiring but I'm not expecting any surprises.

The tone through my little amp (Acoustic B20) was very nice, punchy E string, a very light touch. I really like the zero fret... this is my first instrument with one. Open strings sound like fretted notes. Sustain is nice but will undoubtedly be better with the new strings which are on the way. Bought a nice looking string adapter from Ebay so I can use my preferred strings (DR). With the wood construction and passive pickups, it sounds like a straight-up PJ, but is nice and small, easy to travel with, and very light on the shoulder. I find the strap position ok for now, but I tend to play with my bass pretty high up... if you like things low-slung, I can see you might want to build a strap hook to put a button out between frets 12-17.

All in all, this is a very nice bass at what was an unbeatable price. I've seen these on the 'bay for 5-700, and the bridge itself would grab $150 at least!

 

Fixing Firefox error -203 during add-on installation

I spent a couple days fighting with Firefox 3.6.8 on a fresh install of Snow Leopard.  The problem was that add-ons would not install.  Faced with life without Adblock Plus and Delicious Bookmarks, I was momentarily driven to Safari, but nevertheless needed Firefox for development.

Mozilla know about this problem, and even have a webpage with a solution: http://support.mozilla.com/en-US/kb/Unexpected+installation+error+-203+when+installing+add-ons.  Unfortunately, it didn't work for me.  First, I tried removing the supposedly corrupted extensions database files from my profile.  Then, I tried creating a brand new profile.  But I still had the same error.

It turned out that restarting Firefox was not sufficient.  I had to entirely reboot my machine.  It took me a day to figure this out, because it's a Mac... I don't reboot it, ever.  Well, ok, when there's a major patch and Software Update says reboot.  But that's it.  If I want an OS that needs to reboot, I know where to find it.

So the exact steps, for me, were:

  1. Create a new profile.  Do this by running /Applications/Firefox/Contents/MacOS/firefox-bin -P in a terminal window.  This starts the profile manager, and you can create a brand new profile and delete your old one.
  2. Quit Firefox.
  3. Reboot (sigh)

After logging in and starting Firefox, I could finally install extensions again.

(Edit, Aug 26) I found today that I couldn't install plugins again, same ^&#(*&@ error!  A reboot cleared it up.  I have no idea why this is happening... I am going to chalk it up to a bug in FF 3.6.8.