Import the knowledge of the world into Couchbase

320px-Wikidata-logo-en.svg

I have spend some time with Couchbase experiments again. The result is a tool which imports items from Wikidata: Wikidata-Couchbase-Importer (WCI) – GitHub repository.


Wikidata aims to create a free knowledge base about the world that can be read and edited by humans and machines alike. There are several ways to access the data from Wikidata. For example, you may access data per item.

This is Java in Wikidata: http://www.wikidata.org/wiki/Q251.

The same data in JSON format: https://www.wikidata.org/wiki/Special:EntityData/Q251.json

With WCI you can import Wikidata items in JSON format into Couchbase. Follow these steps to build and run WCI:

name@servant:~$ git clone https://github.com/murygin/wikidata-couchbase-importer.git
name@servant:~$ cd wikidata-couchbase-importer
name@servant:~$ mvn package
name@servant:~$ cd target
name@servant:~$ java -jar wci.jar \
[-u couchbase_url_1[,couchbase_url_2]] \
[-b bucket] \
[-f first_id] \
[-l last_id]

Prerequisites to run the application is a running Couchbase Server. For Couchbase installation instructions, please consult the documentation: “Chapter 2. Installing and Upgrading“.

Run this command to show a list of all options:

name@servant:~$ java -jar wci.jar -h
usage: java -jar wci.jar [-u <couchbase_url>] [-b <bucket>] [-f <first_id>] 
                         [-l <last_id>]
Wikidata couchbase importer (WCI), Copyright (c) 2014 Daniel Murygin.
 -b,--bucket <arg>    Bucket name (default: wikidata)
 -f,--first <arg>     First wikidata item id (default: 1)
 -h,--help            Show help
 -l,--last <arg>      Last wikidata item id (default: first wikidata id)
 -t,--threads <arg>   Number of parallel threads (default 5)
 -u,--urls            Couchbase URL(s), separated by ',' (default: http://127.0.0.1:8091/pools)
For more instructions, see: https://murygin.wordpress.com/2014/02/22/wikidata-couchbase-importer/

To speed up performance importing is done using multiple threads. With 15 parallel threads WCI imports 30 items per second on my notebook with a locally installed Couchbase server. Wikidata has 15,000,000 items at the moment. It would take 139 hours (almost 6 days) to import all items. Until now I have imported 120,000 items. My bucket size is 309 MB. The resulting bucket size for 15,000,000 items is 37.8 GB.

After importing some data you can start to query the database. See this chapter from Couchbase Developer Guide to find out more about finding data: „Finding Data with Views“. In short, you can index and query JSON documents in Couchbase using views. Views are functions written in JavaScript.

This is an view example with lists persons with a special property:

function (doc, meta) {
  // Skip documents that aren't JSON
  if (meta.type == "json") {
    if(doc.claims.P31 && doc.claims.P106) {
      for( p31Id in doc.claims.P31 ) {
        var p31 = doc.claims.P31[p31Id].mainsnak.datavalue.value; 
        for( p106Id in doc.claims.P106 ) {
          var p106 = doc.claims.P106[p106Id].mainsnak.datavalue.value;
          emit([p31,p106], [doc.id,doc.labels.en.value]);
        }
      }     
    }
  }
}

The property / key to find persons which are actors is: [{„entity-type“:“item“,“numeric-id“:5},{„entity-type“:“item“,“numeric-id“:33999}].

  1. No trackbacks yet.

Schreibe einen Kommentar

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden / Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden / Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden / Ändern )

Google+ Foto

Du kommentierst mit Deinem Google+-Konto. Abmelden / Ändern )

Verbinde mit %s

%d Bloggern gefällt das: