Get Noticed 2017 Neo4j

Introduction to graph databases

Almost all my career I’ve spent with relational databases (later called RDB) and more precisely SQL Server. Honestly, I’ve never complained about that. It works fine, integrating with object languages is very quick since we’ve got a lot of different ORMs (like Entity Framework or Dapper) and it’s pretty easy to learn no matter how advanced developer you are. Sounds like a perfect solution, right? I mean, think about all the companies and projects that are still in use. What’s the percentage of those which use RDBMS? 90%? I’d say that even more. But, that does not mean that we shouldn’t look for different solutions. Especially if our domain calls for that.

 

The problem with relational databases

Answering the above question, do RDB are perfect? I’d say that they almost were by the time they were introduced. We need to remember that our industry, so the requirements keep evolving. We started with desktop apps which did simple things and didn’t have to store the data. But soon after that need came out. So, we used a file system for that purpose and so on. Finally, the relational databases were introduced to the industry which helped a lot. There was no problem with storing the data and what’s also important with querying. Just look at the SQL. We can achieve a lot using only 10 clauses (I’ve heard once that SQL was designed for secretaries, that’s why it’s so simple, no offense of course) and that’s awesome. But quickly the requirements became more and more restricted. We migrated from the desktop apps to web servers and then to clouds. And that was the point when someone said „scalability„. We live in times when „virility” (so the process when your app become very popular in a very short time like PokemonGO did) is possible for every app so, we need to be prepared to scale if necessary. Now, clouds are some kind of a „weapon” here. We can scale up without much trouble, of course, having a proper architecture. But a few years ago that was a serious problem. And there was one specific bottleneck – relational databases. They were not designed for that kind of operation simply because in the 80s and 90s nobody thought about such a problem. To be clear, I’m not saying that it’s impossible to scale RDB, but reading some texts about that I’d say that it’s not the easiest thing to do, if you know what I mean.
But besides this scalability, there’s one more thing that people hate about RDB – schemas. Well, no doubt that in many cases having all these constraints is a bless for us, since it’s harder to mess something up. But look at the other side of this. How many times you were pissed because you struggled with doing the migrations? There’s no such a big deal when we add some column or table, but the „fun part” begins where it comes to change the column from nullable to non-nullable. We need to write SQL code with procedures to map this… this is hell. But that’s only one example, there are problems with creating them „concurrently” by two different developers on their own branches and applying one after another, problems with tracking changes in the schema, problems with security levels when applying migrations using 3rd part tools like Octopus Deploy and many others.
Last but not least – data structure. In most cases, having tables is fine and it’s easy to design a proper physical model for the domain. But not always. Simple example – the hierarchies. In order to create the hierarchy in SQL, we need to create a foreign key to the same table and later in query use JOIN to self. That’s not the cleanest code and not the easiest, even with ORMs. Honestly, I’ve tried once to query the entire hierarchy with Entity Framework and I totally failed (btw if you know how to do it, let me know in a comment please!). Once again, hierarchies are just one example here.
So, I hope that now it’s more clear – RDB is not perfect. In many cases, they are the best, but not always. And that’s when NoSQL comes into play.

 

NoSQL

NoSQL came as an alternative to the traditional databases and pretty quickly became quite popular. Two, main differences between them and RDMs are:

  • ability to scale easily
  • having no schema

That helped a lot in the times when web application may handle millions of OPS and its domain may change a lot in time. But how do NoSQL database look like? Well, we can point here four types:

  • document databases – we store the data as documents represented in text format, usually as JSON or XML. The example here is MongoDB, RavenDB or CouchDB.
  • key-value databases – this concept is also simple. We store data as a key-value pair. Once you have a key, you can get the value which might be simple data (like integer or date) or more complex objects. The example here is Redis or Riak.
  • column-family databases – the idea is that we have a key which can address many column-families. Each column family is a group of columns which are connected (not physically but conceptually in the domain) with each other. The example is Cassandra.

 

Now, all the above types have something in common. They all treat the data as one unit. Using other words, they aggregate the data. I hope that you can spot that. One document can be an invoice with all products. A value in key-value concept can also store some complex object which should not be split into smaller pieces. Same story with column family. That’s the reason why the are categorized as aggregate-oriented databases. But there’s one more type which represents the opposite approach.

 

Graph databases

As you probably expect, graph databases represent data as… graph (genius). Each one consist of many objects which can be either a node or a relationship, so same as a mathematical model. Let’s start the discussion with the nodes.

Nodes are the objects which keep our data and they consist of two things. The first is the label which helps to categorize each node to the specific group. The example here might be „user”, „product” or „character”. So just like in relational databases, we can query only chose group but instead of pointing to the table, we filter by the label name. The second part of the node is the set of properties which defines node’s features. Below I presented the example set of properties for the node labeled as „user”:

 


{
login: "user",
password: "why_im_not_hashed?",
role: "admin"
}

 

The second group of the object is relationship. Each one connects nodes and points the direction of the relationship. Like nodes, relationships can be also named in order to make the querying much easier to perform. The difference lies only in naming since more often they are described by the verb, not the noun. The examples are „likes”, „drives”, „is_the_boss_of”. So far this looks pretty cool, but here comes one great thing about graph databases that I just love. The relationships are the first class citizens, not just lines connecting the nodes. Using other words they are equally important. What does that imply? First, like nodes, each relationship can have its own set of properties. The most common are „cost”, „weight” and 聽„distance”, but you can add whatever you want. The second implication of the mentioned equality is that like nodes, you can query relations.

 

How does the graph database look like? (https://neo4j.com/developer/guide-importing-data-and-etl/)

 

When to use graph databases?

As I wrote in the first paragraph, it all depends on the domain. For some business cases using RDB might be very hard and also not efficient (what I mean is a bad performance which comes from using incorrect data structure). There are a lot of „classic” examples here. The most popular is a social-related application like LinkedIn which is mostly based on personal networks where each node is person and there are plenty different relationships like „knows” or „works_with” (of course that’s just my guess since I’m not hired there). That was easy to notice in the previous versions of their web application which presented mentioned personal network on the user’s profile view:

 

http://grow.gardenmediagroup.com/bid/259025/Linkedin-s-New-Profile-Own-the-Change

 

Do you imagine how harder would it be to represent that complex relationship using tables? Well, I think you do 馃槈 But besides socials, there are many other scenarios like anti-fraud search engine which bases on the graph structure or real-time product recommendations. The last case is the one which personally convinced me to use it in my „Get Noticed” project.

 

Why graph databases became popular?

There are several reasons which make graph databases so cool. The first one is intuitiveness which derives directly from the data structure of the graph. If you have ever created a relational database from scratch, you know it’s not that easy and fast to do. We need to define the tables first, then think about a column for each one and set some of them as the foreign keys. We also need to create intersection tables if some many-to-many connection happens and after this longish process of creating the conceptual model we need to transform it to the physical model which is slightly different. That doesn’t happen with graph databases since once you draw your model on the paper it’s already the physical model! There are no transformations happening in the meanwhile. Besides that, reading and analyzing the existing database having this kind of structure if much easier to understand.
Another reason is performance. I’m not writing here about single write or read but more specific scenario when we need to query data like a set of friends of my friends and so five levels deep. The example of the organization which admitted that using graph databases radically increased their performance was Ebay:

 

https://www.slideshare.net/maxdemarzi/neo4j-in-depth

 

Another great thing that not every NoSQL database have is supporting the ACID transactions. The last one, I’d like to mention here is the productivity that comes with writing the queries. No more JOINS, UNIONS and all that ugly stuff mean that it’s much faster to write the code which will return the proper results and also requires way fewer lines of code. The question is – how does the query look like?

 

Neo4j & Cypher

Before we move to the language query itself, it’s worthwhile to mention a couple of words about its original recipient which was Neo4j. It’s an open-source NoSQL graph database founded by Neo Technology implemented in Java and Scala on top of the JVM. Neo4j was published to the world in 2007 and since that, it’s still evolving. There are two editions (Community, Enterprise) which offer:

(Community)

  • optional schema
  • ACID transactional support
  • high-performance

(Enterprise)

  • scalable clustering
  • fail-over
  • high-availability
  • live backups
  • comprehensive monitoring

We know what Neo4j is, so we should go back to the question from the previous paragraph. Well, besides graph database, Neo Technology created a language designed specifically for querying that kind of data structure. They called it Cypher. It’s a declarative query language which uses pattern matching to work. But don’t even think that it means that’s hard to understand and learn! On the contrary, it’s super clear and fast very readable, especially if you know SQL. I’ve read once a great Cypher’s description which said: „It’s an ASCII art query language”. Without further ado, here’s the example:

 


MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)
WHERE movie.title STARTS WITH "T"
RETURN movie.title AS title, collect(actor.name) AS cast
ORDER BY title ASC LIMIT 10;

 

I’m sure that most of us know the result of the above query even if you have never heard about that language 馃槈 What else can we achieve with Cypher? You’ll find out soon in the dedicated series of posts 馃檪