Sunday, November 25, 2012

Hadoop Single Node Set Up

With this post I am hoping to share the procedure to set up Apache Hadoop in single node. Hadoop is used in dealing with Big Data sets where deployment is happening on low-cost commodity hardware. It is a map-reduce framework which map segments of a job among the nodes in a cluster for execution. Though we will not see the exact power of Hadoop running it on single node, it is the first step towards the multi-node cluster. Single node set up is useful in getting familiar with the operations and debugging applications for accuracy, but the performance may be far low than what is achievable.

I am sharing the steps to be followed on a Linux system, as it is supported as both a development and a  production platform for Hadoop. Win32 is supported only as a development platform and equivalent commands for the given Linux commands need to be followed. This Hadoop document includes the details for setting up Hadoop, in brief. I am here sharing a detailed guidance for the set up process with the following line up.

  • Pre-requisites
  • Hadoop Configurations
  • Running the Single Node Cluster
  • Running a Map-reduce job


Java 1.6.X

Java needs to be installed in your node, in order to run Hadoop as it is a java based framework. We can check the Java installation of node by,
pushpalanka@pushpalanka-laptop:~$ java -version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode)

If it is not giving an intended output, we need to install Java by running,
pushpalanka@pushpalanka-laptop:~$ sudo apt-get install sun-java6-jdk
running a downloaded binary java installation package. Then set the environment variable JAVA_HOME in the relevant configuration file(.bashrc for Linux bash shell users).

Create a Hadoop User

At this step we create a user account dedicated for Hadoop installation. This not a "must do" step, but recommended for security reasons and ease of managing the nodes. So we create a group called 'hadoop' and add a new user to the group called 'hpuser'(the names can be of our choice) with the following commands.
pushpalanka@pushpalanka-laptop:~$ sudo addgroup hadoop
pushpalanka@pushpalanka-laptop:~$ sudo adduser --ingroup hadoop hpuser
pushpalanka@pushpalanka-laptop:~$ su hpuser

Now on we will work as hpuser, with the last command.

Enable SSH Access

Hadoop requires SSH(Secure Shell) access to the machines it uses as nodes. This is to create a secured channel to exchange data. So even in single node the localhost need SSH access for the hpuser in order to exchange data for Hadoop operations. Refer this documentation if you need more details on SSH.
We can install SSH with following command.
hpuser@pushpalanka-laptop:~$ sudo apt-get install ssh

Now let's try to SSH the localhost without a pass-phrase.
hpuser@pushpalanka-laptop:~$ ssh localhost
Linux pushpalanka-laptop 2.6.32-33-generic #72-Ubuntu SMP Fri Jul 29 21:08:37 UTC 2011 i686 GNU/Linux
Ubuntu 10.04.4 LTS

Welcome to Ubuntu!

If it does not give something similar to the above, we have to enable SSH access to the localhost as following.
Generate a RSA key pair without a password. (We do not use a password only because then it will prompt us to provide password in each time Hadoop communicate with the node.)
hpuser@pushpalanka-laptop:~$ssh-keygen -t rsa -P ""

Then we need to concatenate the generated public key in the authorized keys list of the localhost. It is done as follows. Then make sure 'ssh localhost' is successful.
hpuser@pushpalanka-laptop:~$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

Now we are ready to move onto Hadoop. :) Use the latest stable release from Hadoop site and I used hadoop-1.0.3.

Hadoop Configurations

Set Hadoop Home

As we set JAVA_HOME at the beginning of this post we need to set HADOOP_HOME too. Open the same file (.bashrc) and add the following two lines at the end.
export HADOOP_HOME=<absolute path to the extracted hadoop distribution>

Disable IPv6

As there will not be any practical need to go for IPv6 addressing inside Hadoop cluster, we are disabling this to be less error-prone. Open the file HADOOP_HOME/conf/ and add the following line.

Set Paths and Configuration

In the same file add the JAVA_HOME too.
export JAVA_HOME=<path to the java installation>

Then we need create a directory to be used as HDFS (Hadoop Distributed File System).Make directory at a place of your choice and make sure the owner of the directory is 'hpuser'. I ll refer this as 'temp_directory'. We can do it by,
hpuser@pushpalanka-laptop:~$ sudo chown hpuser:hadoop <absolute path to temp_directory>

Now we add the following property segments in the relevant file inside the <configuration>......</configuration> tags.
  <value>path to temp_directory</value>
  <description>Location for HDFS.</description>

  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation. </description>

  <description>The host and port that the MapReduce job tracker runs
  at. </description>

  <description>Default number of block replications.

The above value needs to be decided on the priority on speed, space and fault tolerance factors.

Format HDFS

This operation is needed every time we create a new Hadoop cluster. If we do this operation on a running cluster all the data will be lost.What this basically do is creating the Hadoop Distributed File System over the local file system of the cluster.
hpuser@pushpalanka-laptop:~$ <HADOOP_HOME>/bin/hadoop namenode -format
12/09/20 14:39:56 INFO
namenode.NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = pushpalanka-laptop/
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build = -r
1335192; compiled by 'hortonfo' on Tue May 8 20:31:25 UTC 2012
12/09/20 14:39:57 INFOnamenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at pushpalanka-laptop/

If the output is something as above HDFS is formatted successfully. Now we are almost done and ready to see the action.

Running the Single Node Cluster

hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ ./

and you will see,
hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ ./ 
starting namenode, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-namenode-pushpalanka-laptop.out
localhost: starting datanode, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-datanode-pushpalanka-laptop.out
localhost: starting secondarynamenode, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-secondarynamenode-pushpalanka-laptop.out
starting jobtracker, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-jobtracker-pushpalanka-laptop.out
localhost: starting tasktracker, logging to /home/hpuser/hadoop-1.0.3/libexec/../logs/hadoop-hpuser-tasktracker-pushpalanka-laptop.out

which starts several nodes and trackers. We can observe these Hadoop processes using jps tool from Java.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ jps
5767 NameNode
6619 Jps
6242 JobTracker
6440 TaskTracker
6155 SecondaryNameNode
5958 DataNode

Also we can check the ports Hadoop is configured to listen on,
hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ sudo netstat -plten | grep java
[sudo] password for hpuser: 
tcp        0      0 *               LISTEN      1001       175415      8164/java       
tcp        0      0 *               LISTEN      1001       174566      7767/java       
tcp        0      0 *               LISTEN      1001       176269      7962/java

There are several web interfaces available to observe inside behavior, job completion, memory consumption etc. as follows.

  • http://localhost:50070/– web UI of the NameNode
  • http://localhost:50030/– web UI of the JobTracker
  • http://localhost:50060/– web UI of the TaskTracker
    At the moment this will not show any information as we have not put any job for execution to observe the progress.

We can stop the cluster at any time with the following command.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3/bin$ ./ 
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

Running a Map-reduce Job

Let's try to get some work done using Hadoop. We can use the word count example distributed with Hadoop as explained here at Hadoop wiki. In brief what it does is take some text files as inputs, count the number of repetitions of distinct words mapping each line to a mapper and output the reduced result as a text file.
first create a folder and copy some .txt files having several 10000s of words. We are going to count the distinct words' repetitions of those.
hpuser@pushpalanka-laptop:~/tempHadoop$ ls -l
total 1392
-rw-r--r-- 1 hpuser hadoop 1423810 2012-09-21 02:10 pg5000.txt

Then restart the cluster, using file as done before.
Copy the sample input file to HDFS.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop dfs -copyFromLocal /home/hpuser/tempHadoop/ /user/hpuser/testHadoop

Let's check whether it is correctly copied HDFS,
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop dfs -ls /user/hpuser/testHadoop
Found 1 items
-rw-r--r--   1 hpuser supergroup    1423810 2012-09-21 02:17 /user/hpuser/testHadoop/pg5000.txt

Now inputs are ready. Let's run the map reduce job. For this we use a jar distributed with Hadoop which is written to do the needful, which can later refer and learn how the things are done.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hpuser/testHadoop /user/hpuser/testHadoop-output
12/09/21 02:24:34 INFO input.FileInputFormat: Total input paths to process : 1
12/09/21 02:24:34 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/09/21 02:24:34 WARN snappy.LoadSnappy: Snappy native library not loaded
12/09/21 02:24:34 INFO mapred.JobClient: Running job: job_201209210216_0003
12/09/21 02:24:35 INFO mapred.JobClient:  map 0% reduce 0%
12/09/21 02:24:51 INFO mapred.JobClient:  map 100% reduce 0%
12/09/21 02:25:06 INFO mapred.JobClient:  map 100% reduce 100%
12/09/21 02:25:11 INFO mapred.JobClient: Job complete: job_201209210216_0003
12/09/21 02:25:11 INFO mapred.JobClient: Counters: 29
12/09/21 02:25:11 INFO mapred.JobClient:   Job Counters 
12/09/21 02:25:11 INFO mapred.JobClient:     Launched reduce tasks=1
12/09/21 02:25:11 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=17930
12/09/21 02:25:11 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/09/21 02:25:11 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/09/21 02:25:11 INFO mapred.JobClient:     Launched map tasks=1
12/09/21 02:25:11 INFO mapred.JobClient:     Data-local map tasks=1
12/09/21 02:25:11 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14153
12/09/21 02:25:11 INFO mapred.JobClient:   File Output Format Counters 
12/09/21 02:25:11 INFO mapred.JobClient:     Bytes Written=337639
12/09/21 02:25:11 INFO mapred.JobClient:   FileSystemCounters
12/09/21 02:25:11 INFO mapred.JobClient:     FILE_BYTES_READ=466814
12/09/21 02:25:11 INFO mapred.JobClient:     HDFS_BYTES_READ=1423931
12/09/21 02:25:11 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=976811
12/09/21 02:25:11 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=337639
12/09/21 02:25:11 INFO mapred.JobClient:   File Input Format Counters 
12/09/21 02:25:11 INFO mapred.JobClient:     Bytes Read=1423810
12/09/21 02:25:11 INFO mapred.JobClient:   Map-Reduce Framework
12/09/21 02:25:11 INFO mapred.JobClient:     Map output materialized bytes=466814
12/09/21 02:25:11 INFO mapred.JobClient:     Map input records=32121
12/09/21 02:25:11 INFO mapred.JobClient:     Reduce shuffle bytes=466814
12/09/21 02:25:11 INFO mapred.JobClient:     Spilled Records=65930
12/09/21 02:25:11 INFO mapred.JobClient:     Map output bytes=2387668
12/09/21 02:25:11 INFO mapred.JobClient:     CPU time spent (ms)=9850
12/09/21 02:25:11 INFO mapred.JobClient:     Total committed heap usage (bytes)=167575552
12/09/21 02:25:11 INFO mapred.JobClient:     Combine input records=251352
12/09/21 02:25:11 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
12/09/21 02:25:11 INFO mapred.JobClient:     Reduce input records=32965
12/09/21 02:25:11 INFO mapred.JobClient:     Reduce input groups=32965
12/09/21 02:25:11 INFO mapred.JobClient:     Combine output records=32965
12/09/21 02:25:11 INFO mapred.JobClient:     Physical memory (bytes) snapshot=237834240
12/09/21 02:25:11 INFO mapred.JobClient:     Reduce output records=32965
12/09/21 02:25:11 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=778846208
12/09/21 02:25:11 INFO mapred.JobClient:     Map output records=251352

While the job is running if we try out the previously mentioned web interfaces we can observe the progress and utilization of the resources in a summarized way. As given in the command the output file that carries the word counts is written to the '/user/hpuser/testHadoop-output'.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop dfs -ls /user/hpuser/testHadoop-output
Found 3 items
-rw-r--r--   1 hpuser supergroup          0 2012-09-21 02:25 /user/hpuser/testHadoop-output/_SUCCESS
drwxr-xr-x   - hpuser supergroup          0 2012-09-21 02:24 /user/hpuser/testHadoop-output/_logs
-rw-r--r--   1 hpuser supergroup     337639 2012-09-21 02:25 /user/hpuser/testHadoop-output/part-r-00000

To see what is inside the file, let's get it copied to the local file system.
hpuser@pushpalanka-laptop:~/hadoop-1.0.3$ bin/hadoop dfs -getmerge /user/hpuser/testHadoop-output /home/hpuser/tempHadoop/out 
12/09/21 02:38:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library

Here getmerge option is used to merge if there are several files, then the HDFS location of output folder and the desired place for output in local file system are given. Now you can browse to the given output folder and open of the result file, which will something as follows, according to your input file.
"1      1
"1,"    1
"35"    1
"58,"   1
"AS".   1
"Apple  1
"Abs    1
"Ah!    1

Now we have completed setting up a single node cluster with Apache Hadoop and running a map-reduce job. In a next post I ll share how to set up a multi node cluster.

Original Source

Hadoop Single Node Set-up - Guchex
By Pushpalanka Jayawardhana

Tuesday, August 28, 2012

Visualizing Code in Eclipse (Using Architexa)

I am here sharing a nice tool I recently found to draw the diagrams for my final year project at university. When the project is concerned, managing it is challenging with, 
  • code base of the project is pretty large
  • a team is working on the project
  • we need to maintain the quality and performance of the project 
  • proper documentation is also evaluated
So we were searching for an easy to use and nice looking tool to achieve the purposes without making it annoying that, we can focus more on the logic and optimization than decorating a diagram. :) That is were I found this plugin for eclipse called Architexa client, which is free for use.

Following is a sequence diagram I generated using the plugin, which didn't take more than 30s to draw than dragging the relevant classes and setting the relevant calls.

Following is a class diagram generated, that you can clearly observe the clear and smart look of the diagrams.


  1. From several tools that I have tried this was more faster and the diagram quality is great.
  2. It provides you the flexibility on deciding how the diagrams needs to be drawn, but to the rest on it's own.
  3. It provides sharing options with the tool that I am eager to try out.
  4. There several options to save and share the diagrams as images or files inside the project.
Rather than calling this cons, it is more suitable that I call them possible improvements that I haven't even seen them with any other freely available tool I tried.
  • This is little inconvenient to see generated members are not set in an orderly manner that always have to manually place them in order, as in following diagram.
Finally drawing diagrams has become fun, up-to date, truly useful and effective for code development for me. You can checkout whether it fits for you from here
I will share more experience with Architexa as I get more familiar. Cheers!

Monday, August 20, 2012

Generating Key Pairs and Importing Public Key Certificates to a Trusted Keystore

Through this I am sharing the most simple scenario to follow in using Java keytool for the requirements of Apache Wookie projects digital signature implementation. Anyway if you are looking to know how to generate a key pair or import a certificate to a Keystore using keytool, still this may be helpful. Refer this segment of Java SE documentation to know in-depth details.

You needs a configuration of Java in your computer to use keytool and that is enough :).

Generating Key Pairs

Use following command in command prompt to generate a keypair with a self-signed certificate
keytool -genkey -alias wookie -keyalg RSA -keystore wookieKeystore.jks -keysize 4096
After  -alias give the alias to be used for keys
          -keylag give the algorithm to be used in key generation
         -keystore give the name of the keystore with type .jks (You can give a path here to store the keystore in a desired place)
          -keysize give the length for the generating key in bits
This will look something as follows,

That's all and you are having a key pair now. :) 
In Aspects of Wookie, now you can sign Widgets using this keystore. But in order to get your widgets verified and deployed in Wookie server you needs to get your public key trusted by server directly or via a third party.

Generating .cer File

To insert a public key certificate into a trusted keystore it needs to be exported as a .cer file. (There are several other options to use too.)
keytool -v -export -file keystore1.cer -keystore wookieKeystore.jks -alias wookie

Importing Public Key Certificates to a Trusted Keystore

To import a trusted certificate to a trusted keystore following command can be used.
keytool -import -alias keystore1 -file keystore1.cer -keystore wookieKeystore.jks
This command simply says to import the public key certificate of key having alias 'keystore1' which is in the file keystore1.cer to the keystore 'wookieKeystore.jks'.

Now any signature generated using the private key of keystore1 aliased key pair, can be properly validated using wookieKeystore.jks.

Sunday, August 19, 2012

Apache Wookie W3C Widget Digital Signature Implementation - GSoC2012

I am here sharing my GSoC 2012 project details which I enjoyed a lot while learning. This includes a brief introduction on the project with it's design, implementation and guidance on using. The implementation has been done using Apache Santuario 1.5.2 release to generate digital signatures. The project is mentored by Scott Wilson who was so helpful and supported by the community. This is Scott Wilson's post on the project.

What is Apache Wookie?

Apache Wookie is a Java server application that allows widget uploading and deployment as useful for variety of applications. This may include gadgets to quizzes and games. Wookie has been based on W3C widget specification while providing flexibility on Google Wave gadgets etc. At the moment of writing, it is in the Apache incubator getting ready to graduate as an Apache Project. 

Objective of the project

The objective this project is to implement the 'W3C XML Digital Signatures for Widgets' specification in Wookie which has been reported as a new feature. This means automating the widget upload and deployment in a secured manner without  having an administrator to approve each widget. If a widget store or an author is trying to deploy a widget, Wookie needs to be capable of ensuring whether it's a trusted party and let deployment or avoid it as suitable.


The best way to achieve this is letting the widgets' content be digitally signed and verifying the signatures at Wookie before deployment. In the way basic security needs of integrity and non-repudiation can be achieved according to W3C XML digital signature for widgets specification. So a new module has been introduced to Wookie as digsig-client to allow authors and distributors to sign the content of widgets. It provides a user friendly Java Swing UI to get the inputs from the signer. Then process them according to the W3C spec and package into a .wgt ready to be deployed in Wookie server. The client is developed according to Model View pattern separating the signing logic and UI having a coordinator among them.

Then the administrators are allowed to set up the security level of the server whether to reject the widgets that do not submit a valid signatures or to allow them under a warning. 



The signature is generated according to the exact details specified in the spec as,
  • Signature algorithm - is RSA using the RSAwithSHA256
  • Key length - for RSA is 4096 bits.
  • Digest method - SHA-256.
  • Canonicalization algorithm - Canonical XML Version 1.1 (omits comments) 
  • Certificate format - X.509 version 3

Implementation has been done using Apache Santuario 1.5.2 release. To achieve the highest possible security, the public key certificate of trusted signers needs to be imported to the trusted keystore of Wookie. You can access the code at current SVN of Apache Wookie which can be used to generate a detached signatures on any other content too. 


Wookie supports 2 methods of widget deployment that,
  • Drop .wgt into /deploy
  • POST with attachment to /wookie/widgets 
Widgets sent-in both ways are verified  in the same manner starting from W3CWidgetFactory class. It bends the flow through DigitalSignatureProcessor class if the signature verification is turned on in file. The processor identifies the signature files inside the widget and try to verify each extracting the relevant public key certificate from Wookie trusted store. According to the settings at, it decides whether to reject widget or deploy with a warning in case of a invalid widget found.

In the processor it, 

  • Check for presence of signature file,
  • Categorize them as author or distributor to understand the which content should be addressed in signature,
  • Check whether all content is signed according to the W3C widget digsig spec.
  • Retrieve trusted public key certificate from Wookie trussted keystore
  • Throws exceptions or put warnings when invalid signatures are encountered
  • handle widget deployment accordingly

You can access the code at current SVN of Apache Wookie.


When the digsig-client is run it will show up a Swing UI as follows.

  • Have to select the role of the signer whether author or distributor
  • Then should point to the keystore which stores the relevant private key
  • It's password should also be given followed by private key alias
  • Then if private key password is same as the keystore password and certificate alias is same as private key alias you can leave it blank.
  • Then once you select the widget folder from the file chooser, digsig-client will show the files selected to be signed from the widget according the role. This will automatically skip files starting with '.' and signature files as mentioned in the W3C spec.
  • Then you can give the name of widget 
  • Once signed the <name>.wgt file is stored in the widget path including signed content and generated signature file, being ready to send to deployment as follows. 

In Wookie Server side, administrators need to set the following properties according the security requirement of their context. This is located at widgetserver,properties file.
# digital signature settings
# Set this property to have Wookie check widget digital signatures when
# deploying a widget
# Set this property to determine how Wookie treats widgets with invalid
# digital signatures.
# If set to true, Wookie will not deploy any widgets with invalid
# digital signatures. If set to false, the widget will be imported
# and a warning logged.
# If set to true, Wookie will only deploy Widgets that have valid digital signatures
# AND that each signature uses a trusted certificate located in the trusted keystore,
# disregarding setting the above to false.
# Name of the trusted keystore file
#Password for the trusted keystore file
widget.deployment.trustedkeystore.password = wookie
After that Wookie server will handle the deployment of widgets in a secured manner given that the trusted public certificates are imported into the trusted keystore of the Wookie server.
Following is a sample author signature generated by digsig-client.
<ds:Signature xmlns:ds="" Id="AuthorSignature">
<ds:CanonicalizationMethod Algorithm=""></ds:CanonicalizationMethod>
<ds:SignatureMethod Algorithm=""></ds:SignatureMethod>
<ds:Reference URI="build.xml">
<ds:DigestMethod Algorithm=""></ds:DigestMethod>
<ds:Reference URI="config.xml">
<ds:DigestMethod Algorithm=""></ds:DigestMethod>
<ds:Reference URI="images/background.jpg">
<ds:DigestMethod Algorithm=""></ds:DigestMethod>
<ds:Reference URI="#prop">
<ds:Transform Algorithm=""></ds:Transform>
<ds:DigestMethod Algorithm=""></ds:DigestMethod>
<ds:Object Id="prop">
<ds:SignatureProperties xmlns:dsp="">
<ds:SignatureProperty Id="profile" Target="#AuthorSignature"><dsp:Profile URI=""></dsp:Profile></ds:SignatureProperty>
<ds:SignatureProperty Id="role" Target="#AuthorSignature"><dsp:Role URI=""></dsp:Role></ds:SignatureProperty>
<ds:SignatureProperty Id="identifier" Target="#AuthorSignature"><dsp:Identifier>test</dsp:Identifier></ds:SignatureProperty>

The relevant steps for generating key pairs needed for digital signatures and procedure to import public certificates can be found in this post.
Enjoy with Wookie...............!

Sunday, July 01, 2012

FYP-Kanthaka - Big Data CDR Analyzer

This is the presentation we at team 'Kanthaka', did on the major progress evaluation after a thorough literature review on the technologies to be used in developing the big data CDR analyzer. This will be beneficial for anyone who is looking for technologies to deal with big volumes of data with near real time execution speed requirements. Kanthaka - High Volume CDR Analyzer

Friday, June 29, 2012

Bulk-Loading Data to Cassandra with sstable or JMX

The 'sstableloader' introduced from Apache Cassandra 0.8.1 onwards, provides a powerful way to load huge volumes of data into a Cassandra cluster. If you are moving from a cloud cluster to a dedicated cluster or vice-versa or  from a different database to Cassandra you will be interested in this tool. As shown below in whatever case if you can generate the 'sstable' from the data to be loaded into Cassandra, you can load it in bulk to the cluster using 'sstableloader'. I have tried it in version 1.1.2 here.

With this post I ll share my experience where I created sstables from a .csv file and loaded to a Cassandra instance running on same machine, which acts as the cluster here.
  1. sstable generation
  2. Bulk loading Cassandra using sstableloader
  3. Using JMX

'sstable' generation

To generate 'SSTableSimpleUnsortedWriter' the 'cassandra.yaml' file should be present in the class path. In Intellij Idea you can do it in Run-->Edit Configurations-->Application-->Configuration-->VM Options. There you should give the path to cassandra.yaml as follows.

-Dcassandra-foreground -Dcassandra.config=file:///<path to/apache-cassandra-1.1.2/conf/cassandra.yaml> -ea -Xmx1G

Here is the simple code to generate the sstables according to the context I tried, referring the documentation in Datastax. With just few modifications you may be able to use it. I ll try to explain the code bit below.

SSTableSimpleUnsortedWriter eventWriter = new SSTableSimpleUnsortedWriter(directory, partitioner, keySpace, "Events", AsciiType.instance,null, 64);
This writer does not assume any order in the rows. Instead it buffers the rows in memory and write them in sorted order. You can define a threshold on the amount of rows to be buffered to avoid loading entire data set in memory. Each time the threshold is reached one sstable is created and buffer is rested.

directory - The directory to write sstables
partitioner - strategy to distribute data over the nodes. I have used RandomPartitioner which use MD5 hash value to distribute data. This blogpost may help you to decide on what to use according to your context from RandomPartitioner and OrderPreservingPartitioner. There are two more partitioners available.
keySpace - the Keyspace name
"Events" - name of column family
AsciiType.instance - the column family comparator
null - the subComparator is set to null as this is not a super column family
64 - buffer size in MB. This should be decided upon the context to achieve best performance

With the following code we are creating the rows and adding columns of each row according to the entry read from .csv file. As for the Cassandra wiki one row can have upto 2Billion columns like this.

eventWriter.addColumn(bytes("sourceAdd"),bytes(entry.sourceAdd), timestamp);
eventWriter.addColumn(bytes("sourceChannelType"),bytes(entry.sourceChannelType), timestamp);

The static nested class CsvEntry is used to read just the relevant fields from the csv row.

Once you run the code pointing to the csv file, there will be a directory created in the location you specified as 'directory'. Inside it you will find something similar to following which contains the created sstables.

Bulk loading Cassandra using sstableloader

Inside bin directory of Cassandra you can find this tool sstableloader. You can run it through command line pointing to the above generated sstables. Good guidance on that can be found in Datastax and this blog. Also you can directly use the class '' in java code to load the sstables to a Cassandra cluster.

If you are testing all this in localhost, following steps need to be taken to try out sstableloader.

  • Get a copy of the running Cassandra instance
  • Set another loop-back address. In Linux you can do it using,
sudo ifconfig lo:0 netmask up
  • Set the rpc address and listen address of the copied /conf/casandra.yaml to Of course you can set rpc address to if you want to listen all interfaces.
  • Then from the copied Cassandra run sstableloader we run sstableloader from command line as follows,
./sstableloader -d <path to generated sstables>
  • It needs to be noticed the path should end as /keyspace_name/columnfamily_name (eg : ...../CDRs/Events for the above screenshot)

Using JMX bulk load

You can also use this code to bulk load Cassandra from generated sstables. I received this from Cassandra user mailing list from Brian Jeltema. The main method needs to be run giving path to generated sstables as above, as an argument.

Monday, June 11, 2012

Running Cassandra in a Multi-node Cluster

This post gathers the steps I followed in setting up an Apache Cassandra cluster in multi-node. I have referred Cassandra wiki and Datastax documentation in setting up my cluster. The following procedure is expressed in details, sharing my experience in setting up the cluster.
  1. Setting up first node
  2. Adding other nodes
  3. Monitoring the cluster - nodetool, jConsole, Cassandra GUI

I used Cassandra 1.1.0 and Cassandra GUI - cassandra-gui-0.8.0-beta1 version(As older release had problems in showing data) in Ubuntu OS.

Setting up first node

Open cassandra.yaml which is in 'apache-cassandra-1.1.0/conf'.
Change listen_address: localhost -->  listen_address: <node IP address>
         rpc_address: localhost -->  rpc_address: <node IP address>
- seeds: "" --> - seeds: "node IP address"
The listen address defines where the other nodes in the cluster should connect. So in a multi-node cluster it should to changed to it's identical address of Ethernet interface. 
The rpc address defines where the node is listening to clients. So it can be same as node IP address or set it to wildcard if we want to listen Thrift clients on all available interfaces.
The seeds act as the communication points. When a new node joins the cluster it contact the seeds and get the information about the ring and basics of other nodes. So in multi-node, it needs to be changed to a routable address  as above which makes this node a seed.

Note: In multi-node cluster, it is better to have multiple seeds. Though it doesn't mean to have a single point of failure in using one node as a seed, it will make delays in spreading status message around the ring.  A list of nodes to be act as seeds can be defined as follows,

- seeds: "<ip1>,<ip2>,<ip3>"

For the moment let's go forward with previous configuration with single seed. Now we can simply start Cassandra on this node, which will run perfect without the rest of the nodes. Let's imagine our cluster need increased performance and more data is feeding to the system. So it's the time to add another node to the cluster.

Adding other nodes

Simply copy the Apache Cassandra folder of first node to each of these. Now replace the listen_address: <node IP address> and rpc_address: <node IP address> as relevant for each node. (No need to touch seeds section) When we start each node now it will join the ring, using the seeds as hubs of the gossip network. In the logs it will show up the information related to other nodes in the cluster as it can see.

Monitoring the cluster

Nodetool - This is shipped with Apache Cassandra. We can run it being inside Cassandra folder with bin/nodetool . With the ring command of nodetool we can check some information of the ring as follows.

bin/nodetool -host <node IP address> ring

It has lot more useful functionalities which can be referred at site.

jConsole - We can use this to monitor usage of memory, thread behavior etc. It is so helpful to analyse the cluster in detail and to fine tune the performance. This guide also carries good information on using jConsole if you are not familiar with it already.

Cassandra GUI - This is to satisfy the need to visualize the data inside the cluster. With this we can see the content distributed across the cluster at one place.

Tuesday, May 15, 2012

Configuration for Using Intellij IDEA to Develop Apache Wookie

Here I am sharing the steps I followed in configuring Intellij IDEA to develop Apache Wookie. As guidelines are only given in the site to start with Eclipse IDE, before starting coding I needed to configure Intellij IDEA to be aligned with the practices and conventions of Wookie, before starting coding for GSoC2012.

For the project we have to use SVN as version control system and Intellij by default facilitate it. We need to do the following set ups in addition.
  1. Dependency management
  2. Code style and template
  3. Debugging
Let’s look at how to deal with these.

Dependency management with Apache Ivy

Wookie use Ivy as the dependency manager. To easily compile the Wookie project inside Intellij, we need to have the plugging IvyIDEA installed. In order to do so, go to File>Settings>IDE Settings and select IvyIDEA from available pluggings as follows and install.


Now once you open Wookie in Intellij, it will auto detect the ivy.xml file and resolve dependencies. In Tools>IvyIDEA menu you can resolve any dependencies if not auto detected. Further details are can be read on this plugin, at Intellij plugging repository

Code style and template

These steps are taken to conveniently maintain consistency and quality of the code.The etc/Intellij folder of Wookie distribution includes code_style.xml and code_template.xml which need to be placed inside Intellij configuration as follows. 
Place code_style.xml at <username>/<Intellij folder>/config/codestyles.
Note:To have the code style correct, it is better to use this plugging eclipse-code-formatter inside Intellij as it allows you to use original code style, which is distributed to be used in Eclipse.
Once this is done and code style is active for the project, Ctrl+Alt+L will reformat code accordingly.

Place code_template.xml at <username>/<Intellij folder>/config/templates.After restarting the IDE you can see the available templates by Ctrl+J key combination and easily use them.


Start Wookie server in debug mode using the command ,
ant -Djvmargs="-Xdebug -Xrunjdwp:transport=dt_socket,address=8001,server=y,suspend=n" run
To connect Intellij to this server running on your local machine, go to Run>Edit configuration in IDE. On remote configuration do the following.
You can now set debug points and contribute to improve Wookie.

Wednesday, April 25, 2012

GSoC2012 with Apache Wookie

Today I got the news of the proposal I submitted to GSoC2012, is accepted. I consider it a great achievement and so excited to make the project a success. With the nature of the program it is no wonder anyone get excited about it. Firstly having the chance to work for a recognized company(In my case Apache) and the global reputation a gsocer can have is so motivational. Also getting started to work with a strange team, getting to know them, work remotely, add good experience to life and high professional value as I guess. It's amazing to work with the community. Also it is a great chance to broaden the horizons in technical skills while having guidance from an expert in the area, having hands on it. 

Also I'm glad to share about my project, which is to implement 'W3C XML Digital Signatures for Widgets Specification in Apache Wookie. Computer Security has become the favorite field of me followed by Big data after my internship at WSO2 Lanka(pvt) Ltd, with experience I had there, relevantly. So no wonder when I saw this idea in the page, I knew it is ideal for me. Also Wookie is an interesting project currently at incubating stage at Apache. It is based on W3C widget specification and also include widgets that use extended APIs such as OpenSocial and Google Wave Gadgets. It will soon graduate with the passionate developer community and glad I can contribute actively. My mentor Scott Wilson is a very friendly and helpful person who guided me  in submitting a better proposal and who will be guiding me through out GSoC2012. Here is a blogpost by Scott mentioning of Wookie acceptance to Apache Incubator.

Hoping for a fruitful time ahead and to become a gsocer while contributing to open source world!!! Thank you GSoC organizers, Wookie community, Department of Computer Science and Engineering,University of Moratuwa and WSO2 for the knowledge I gathered and everyone who helped me in my way!!!

Monday, February 06, 2012

Implementing SAML to XACML

Before Implementing SAML


This is how a XACML request will looks like when it is arriving at PDP(Policy Decision Point) to be evaluated.
<Request xmlns="urn:oasis:names:tc:xacml:2.0:context:schema:os">
    <Attribute AttributeId="urn:oasis:names:tc:xacml:1.0:subject:subject-id"
    <Attribute AttributeId="urn:oasis:names:tc:xacml:1.0:resource:resource-id"
    <Attribute AttributeId="urn:oasis:names:tc:xacml:1.0:action:action-id"
Basically it states who is(Subject) wanting to access which resource and what action it wants to perform on the resource. PDP trusts that request made is not altered while being sent and received, evaluates the request against existing enabled policies and reply with the decision which will be as follows.
<Result ResourceId="http://localhost:8280/services/echo/echoString">
    <StatusCode Value="urn:oasis:names:tc:xacml:1.0:status:ok"/>
Again there is no guarantee for the party who is using this response that this decision is not altered since sent from PDP until been received.
In order achieve the security of XACML requests and responses in server to server communication  SAML profile for XACML is defined by OASIS.This take the system security to a higher level by allowing the usage of fine-grained authorization provided by XACML, to be signed.

After Implementing SAML

Following is how the previous XACML request looks like after wrapped into a XACMLAuthzDecisionQueryType, which is generated using OpenSAML 2.0.0 library which is supporting SAML profile of XACML as declared in 2004. The diagram shows the basic structure of a XACMLAuthzDecisionQueryType.

Following is a sample XACMLAuthzDecisionQuery.
<xacml-samlp:XACMLAuthzDecisionQueryType InputContextOnly="true" IssueInstant="2011-10-31T06:44:57.766Z" ReturnContext="false" Version="2.0" xmlns:xacml-samlp="urn:oasis:names:tc:xacml:2.0:profile:saml2.0:v2:schema:protocol">
<saml:Issuer SPProvidedID="SPPProvierId" xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion"></saml:Issuer>
<ds:Signature xmlns:ds="">
   <ds:CanonicalizationMethod Algorithm=""/>
   <ds:SignatureMethod Algorithm=""/>
   <ds:Reference URI="">
      <ds:Transform Algorithm=""/>
      <ds:Transform Algorithm="">
      <ec:InclusiveNamespaces PrefixList="ds saml xacml-context xacml-samlp"       xmlns:ec=""/>
      <ds:DigestMethod Algorithm=""/>
<xacml-context:Request xmlns:xacml-context="urn:oasis:names:tc:xacml:2.0:context:schema:os"><xacml-context:Subject SubjectCategory="urn:oasis:names:tc:xacml:1.0:subject-category:access-subject" xmlns:xacml-context="urn:oasis:names:tc:xacml:2.0:context:schema:os"><xacml-context:Attribute AttributeId="urn:oasis:names:tc:xacml:1.0:subject:subject-id" DataType=""><xacml-context:AttributeValue>admin</xacml-context:AttributeValue></xacml-context:Attribute></xacml-context:Subject><xacml-context:Resource xmlns:xacml-context="urn:oasis:names:tc:xacml:2.0:context:schema:os"><xacml-context:Attribute AttributeId="urn:oasis:names:tc:xacml:1.0:resource:resource-id" DataType=""><xacml-context:AttributeValue>http://localhost:8280/services/echo/echoString</xacml-context:AttributeValue></xacml-context:Attribute></xacml-context:Resource><xacml-context:Action xmlns:xacml-context="urn:oasis:names:tc:xacml:2.0:context:schema:os"><xacml-context:Attribute AttributeId="urn:oasis:names:tc:xacml:1.0:action:action-id" DataType=""><xacml-context:AttributeValue>read</xacml-context:AttributeValue></xacml-context:Attribute></xacml-context:Action><xacml-context:Environment xmlns:xacml-context="urn:oasis:names:tc:xacml:2.0:context:schema:os"/>

As you can see it carries lot of information related to the content of the request like who issued it , when, signature with the X509Certificate and the XACML request. Data integrity can be preserved in this way. 

After executing the request and gaining the response from PDP, it is also sent secured with a signature. The diagram shows the structure of a basic SAML Response.

Following is a sample SAML response that carries  XACML response.

<samlp:Response IssueInstant="2011-10-31T06:49:51.013Z" Version="2.0" xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol">
<saml:Issuer SPProvidedID="SPPProvierId" xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion"></saml:Issuer>
<ds:Signature xmlns:ds="">
   <ds:CanonicalizationMethod Algorithm=""/>
   <ds:SignatureMethod Algorithm=""/>
   <ds:Reference URI="">
   <ds:Transform Algorithm=""/>
   <ds:Transform Algorithm="">
   <ec:InclusiveNamespaces PrefixList="ds saml samlp xacml-context xacml-saml" 
   <ds:DigestMethod Algorithm=""/>
<saml:Assertion IssueInstant="2011-10-31T06:49:51.008Z" Version="2.0" xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion">
<saml:Issuer SPProvidedID="SPPProvierId"></saml:Issuer>
<saml:Statement xmlns:xacml-saml="urn:oasis:names:tc:xacml:2.0:profile:saml2.0:v2:schema:assertion" xmlns:xsi="" xsi:type="xacml-saml:XACMLAuthzDecisionStatementType">
<xacml-context:Response xmlns:xacml-context="urn:oasis:names:tc:xacml:2.0:context:schema:os">
<xacml-context:Result ResourceId="http://localhost:8280/services/echo/echoString"
<xacml-context:Status><xacml-context:StatusCode Value="urn:oasis:names:tc:xacml:1.0:status:ok"/>
The XACML response is wrapped into a SAML statement which is included in a SAML assertion that is again wrapped by a SAML response.I have only signed the response according to the context and included only one assertion. We can separately sign both the assertion and response according to the spec and include more assertions in one response. Also it is possible to send the relevant XACML request inside the response and lot more options are there according to the spec. With OpenSAML we can get most of them into action.