Monday, 19 March 2018

Quick and clever data sparsity / density tool

Hi all,

Just a quick post today to share a clever python tool I’m using for data sparsity / density analysis.

  • Data sparsity : number or percentage of cells that are empty.
  • Data density : number or percentage of cells that contain information.


It’s quite common to find tools or libraries that aim to analyse data and deliver indicators. What I wanted is to have a datavisualization tool in order to display a meaningful picture of data density / sparsity.

Here comes “missingno”, developed by Aleksey Bilogur, a really talented data analyst from NYC, and available on github.

No more bla-bla, here is what you can get with simple python code within your Jupyter editor.


You can clearly see the amout of data available for each column. Not the nice sparkline on the right, showing “missing data bursts”.

Different plots are available, have a look on this “heatmap” showing nullity correlation : how the presence or absence of one variable has a correlation in the presence of another.


Bars, GeoPlot and Dendogram are also available.

Definitely a must have tool for all python and data enthusiasts.

Tuesday, 2 May 2017

Reading online SDMX data from R

Hi all,

After writing some time ago (long time ...), I recently had to query online SDMX data from R.

Nothing more easier than playing with RSDMX package.

Just install this package : install.packages("rsdmx")
Then just play with it.

Here is a short exemple with a query to one ECB SDMX datasource.
 # Install and load the package  

 # First, set a proxy if you are behind corporate walls  

 # Then store your custom url, the one having the dataset you need  
 myUrl <- ""  

 # Read and parse the data  
 dataset <- readSDMX(myUrl)  

 #Print the data. Tadam, your dataset is in a dataframe !   
 stats <-   

 # Now, do intelligent stuff ...  

Now, the data.
Really easy and time saving.

Monday, 15 December 2014

Easy query to SDMX data

Hi all,

As usual, too many things to share and too little time to write.
Well, this time, I'm doing it.

I'm currently working on a Data Federation / Data Virtualization project, aiming at virtualizing data coming from different horizons : public data from web services, commercial data coming from data feeds, internal and relational data etc ...

One of my data sources is the Statistical DataWarehouse (SDW) from the European Central Bank (ECB). That's funny because 12 or 13 years ago, I was feeding that ECB Warehouse while working for the French National Bank (NCBs).
This warehouse is open to every one and you will find data about employment, production ... well a lot of economic topics well organized into "Concepts" :
  • Monetay operations,
  • Exchange rates,
  • Payment and securities trading
  • Monetary statistics and lots of funny things like this ...
This warehouse can be queried with the use of the ECB front end, located here.
You can also query it by using the REST services. That's my preferred choice for data processing automation and this article will developp this point.

Before querying the data, let's have a quick explanation about the SDMX format that is used by the ECB.

SDMX, theory

SDMX stands for Statistical Data and MetaData eXchange. This project started in 2002 and aims at giving a standard for statistical data and metadata exchange. Several famous institutions are at the origin of SDMX :
SDMX is an implementation of ebXML.
You will find a nice SDMX tutorial here, for the moment here is a quick model description :
  • Descriptor concepts : give sense to a statistical observation
  • Packaging structure : hierarchy for statistical data : observation level, group level, dataset level ...
  • Dimensions and attributes : dimensions for identification and description. Attributes for description only.
  • Keys : dimensions are grouped into key sequence and identify an item
  • Code lists : list of values
  • Data Structure Definition : description for structures

SDMX, by example

Here is an example of what SDMX data is. This is an excerpt of a much longer file.
Click for a larger view.

As you can see, we have :
  • Metadatas :
    • Serieskey : giving all the values for each dimensions.
    • Attributes : definition for this dataset
  • Data
    • Observation dimension : time, in this example.
    • Observation value : the value itself.

How to build a query to ECB data

There is nothing easier for that : use the REST web services provided by the ECB.
These web services will allow you to :
  • Query metadata, in order 
  • Query structure definitions
  • Query data : this is the interesting part for this article !
The REST endpoint is here :
And you can learn more about these services here

But let me write a quick overview now. It is very simple as sooon as you understood it.
Let's write a REST query for some ECB data.

First you need the base service url.
  • Easy, it is :
Then, you need to target a resource. Easy, it is "data".
  • Now you have the url :
Ok, let's go further, now we need to specify an "Agency". By default it is ECB, but in our example let's go for EUROSTAT.
  • We have the url :
And we continue, now we need a serie name. For this example, let's go with IEAQ.
  • The url is :,ieaq
Simple so far, now let's do the interesing par : the key path !
The combination of dimensions allows statistical data to be uniquely identified. This combination is known as series key in SDMX.
Look at the picture below, it shows you how to build a serie key for targetting data for our on going example.

When looking a metadata from the IEAQ serie, we see we need 13 keys to identify an indicator. These keys are ranging from easy ones like FREQ (Frequency) or REF_AREA (country) to complex (business) keys like ESA95TP_ASSET or ESA95TP_SECTOR.
Now we need a value for each of these dimensions, then "stacking" these values with dots (don't forget to follow the order given by the metadata shot).
We now have our key : Q.FR.N.V.LE.F2M.S1M.A1.S.1.N.N.Z

Another way for understanding is considering the keys as coordinates. By choosing a value for each key, you build coordinates, like lat/long, that identify and locate a dataset.
I chose the cube representation below to illustrate the concept of keys as coordinates (of course, a dataset can have more keys than a cube has sides ...). You can see how a flat metadata representation is translated into a multidimensional structure.

Now, query some data

To query the data, nothing difficult. Simply paste the complete URL into a browser, then after a few latency, you'll see the data.

Here is the top of the xml answer, showing some metadata.

And here is the DATA ! (2 snapshots for simplicity but metadata and data are coming within the same xml answer).
In red, the data. In green, the time dimesion. In blue, the value !

Query and process the data !

Ok, calling for data from a web browser is nice but not really usefull : data stays in the browser, we need to parse and transform it in order to set up a dataset ...

Here I will introduce some shell code I used in a larger project, where I had to run massive queries against the ECB SDW and build a full data streaming process.

The command below with allow you to run a query and parse the data for easy extraction. I'm using the powerfull xmlstarlet software here.

The command : 

curl -g ",IEAQ/Q.FR.N.V.LE.F2M.S1M.A1.S.1.N.N.Z" \
-s | xmlstarlet sel -t -m "/message:GenericData/message:DataSet/generic:Series/generic:Obs" \
-n -v "generic:ObsDimension/@value" -o "|" \
-v "generic:ObsValue/@value" -o "|" \
-v "generic:Attributes/generic:Value[@id='OBS_STATUS']/@value" -o "|" \
-v "generic:Attributes/generic:Value[@id='OBS_CONF']/@value" -o "|"

The output, in shell (easy to pipe into some txt files ...) :


The ECB SDW is massive. It contains loads and loads of series, datasets etc ...
Have a look to this partial inventory I did recently.
As you can see, the amount of data is important.
My best recommendation, at this point, would be to first :
  • read about the ECB SDW metadata,
  • read about the ECB SDW structures,
  • learn how to build complex queries (I only gave a very simple example here).

Here is, once again, the most important documentation about the SDW and how it is organized :

Thursday, 6 November 2014

Easily compare AWS instance prices

Hi all,

It's been a while ...
I know, lot of work. Currently working on data federation with JBOSS Teiid as well as with Composite Software.
Will write some articles about that ...

For the moment, I want to share a very handy webpage that aim at comparing AWS instance prices. You can modify your variables (location, memory, cpu etc ...) and the prices will change.
Really handy.

Thursday, 29 May 2014

Eclipse Trader !

Hi all,

I’m currently looking on open financial / stocks data feeds. There is a number of things available here !Want to build a Reuters or Bloomberg terminal like ? Have a look to this Eclipse plugin !

Features are great :

  • Realtime Quotes
  • Intraday Charts
  • History Charts
  • Technical Analisys Indicators
  • Price Patterns Detection
  • Financial News
  • Level II (Book) Market Data
  • Trading Accounts Management
  • Integrated Trading




Hi all,

I love this site :

Great dataviz for financials. Realtime (or just about).


And some quick dataviz about data grabbed here : Sheryl Sandberg trading Facebook shares over first half of 2014…


Saturday, 10 May 2014

Load ElasticSearch with tweets from Twitter

Hi all,

Today, I’m gonna give you a quick overview about how to load an ElasticSearch index (and make it searchable !) with data coming from Twitter. That’s pretty easy with the ElasticSearch River ! You must be familiar with ElasticSearch in order to fully understand what is coming. To learn more about the river, please read this.

As usual, a quick schema to explain how it works.



Pretty simple as it comes as plugin. Just type and run (for release 2.0.0) :

  • bin/plugin -install elasticsearch/elasticsearch-river-twitter/2.0.0

Then, if all went fine, you should have your plugin installed.

I highly recommend to install another plugin called Head. This plugin will allow you to fully use and manage your ElasticSearch system within a neat and comfortable GUI. Here is a pic :


Register to Twitter

Now you need to be registered on Twitter in order to be able to use the river. Please login to : and folllow the process. The outcome of this process is to obtain four credentials needed to use the Twitter API :

  • Consumer key (aka API key),
  • Consumer secret (aka API secret),
  • Access token,
  • Access token secret.

These tokens will be used in the query when creating the river (see below).


A simple river

Ok, here we go. Just imagine I want to load, in near real time and continuously, a new index with all the tweets about …. Ukrain (since today, May 9 2014, the cold war is about to start again …).

That’s pretty simple, just connect to your shell, make sure ElasticSearch is running and that you have all the necessary credentials. Then send the PUT query below. Don’t forget to type your OAuth credentials (mine are under the form ****** below ).

curl -XPUT http://localhost:9200/_river/my_twitter_river/_meta -d '
    "type": "twitter",
    "twitter": {
        "oauth": {
            "consumer_key": "******************",
            "consumer_secret": "**************************************",
            "access_token": "**************************************",
            "access_token_secret": "********************************"
        "filter": {
            "tracks": "ukrain",
            "language": "en"
        "index": {
            "index": "my_twitter_river",
            "type": "status",
            "bulk_size": 100,
            "flush_interval": "5s"

After submitting this query, you should see something like :

[2014-05-09 10:58:58,533][INFO ][cluster.metadata         ] [Rex Mundi] [_river] creating index, cause [auto(index api)], shards [1]/[1], mappings []
[2014-05-09 10:58:58,868][INFO ][cluster.metadata         ] [Rex Mundi] [_river] update_mapping [my_twitter_river] (dynamic)
[2014-05-09 10:58:59,000][INFO ][river.twitter            ] [Mogul of the Mystic Mountain] [twitter][my_twitter_river] creating twitter stream river
{"_index":"_river","_type":"my_twitter_river","_id":"_meta","_version":1,"created":true}tor@ubuntu:~/elasticsearch/elasticsearch-1.1.1$ [2014-05-09 10:58:59,111][INFO ][cluster.metadata         ] [Rex Mundi] [my_twitter_river] creating index, cause [api], shards [5]/[1], mappings [status]
[2014-05-09 10:58:59,381][INFO ][river.twitter            ] [Mogul of the Mystic Mountain] [twitter][my_twitter_river] starting filter twitter stream
[2014-05-09 10:58:59,395][INFO ][twitter4j.TwitterStreamImpl] Establishing connection.
[2014-05-09 10:58:59,796][INFO ][cluster.metadata         ] [Rex Mundi] [_river] update_mapping [my_twitter_river] (dynamic)
[2014-05-09 10:59:31,221][INFO ][twitter4j.TwitterStreamImpl] Connection established.
[2014-05-09 10:59:31,221][INFO ][twitter4j.TwitterStreamImpl] Receiving status stream.

A quick explanation about the river creation query :

  • A river called “_river” is created
  • 2 filters are used :
    • track : track the keyword
    • language : tweet languages to track
  • An index called “my_twitter_river” is created
  • A type (aka a table) called “status” is created
  • Tweets will be indexed :
    • once a bulk size of them have been accumulated (100, default)
      • OR
    • everyflush interval period (5 seconds, default)

By now, you should see the data coming into your newly index.


A first query

Now, time to run a simple query about this “Ukrain” related data.

For now, I will only send a basic query based on a keyword, but stay tuned because I will soon demonstrate how to create analytics on these tweets …

Simple query to retrieve tweets having the word “Putin” in it :

curl -XPOST http://localhost:9200/_search -d '
  "query": {
    "query_string": {
      "query": "Putin",
      "fields": [

… and you’ll have something like :