Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Friday, May 12, 2017

Do we still need to talk about Data Vault 2.0 Hash keys?

A few days ago, I ran into the article "Hash Keys In The Data Vault", published recently (2017-04-28) on the the Scalefree Company blog. Scalefree is a company, founded by Dan Linstedt and Michael Olschminke. Linstedt is the inventor of Data Vault, which is a method to model and implement enterprise data warehouses.

The article focuses on the use of hash-functions in Data Vault data warehousing. To be precise, it explains how Data Vault 2.0 differs from Data Vault 1.0 by using hash-functions rather than sequences to generate surrogate key values for business keys. In addition, hash-functions are suggested as a tool to detect change of non-business key attributes to track how their values change over time.

Abstract

First I will analyze and comment on the Scalefree article and DV 2.0, and explain a number of tenets of DV thinking along the way. Critical comments will be made about using hash values as primary keys in DV 2.0, and the apparent lack of progress made in DV thinking regarding this matter. A discussion follows about the birthday problem and how it relates to DV 2.0 usage of hash functions. Using the Square approximation method it will be demonstrated how we can make informed and accurate decisions about the risk of a collision with regard to the data volume and choice of hash function. Different scenarios regarding the impact of a collision on data integrity will then be explored. Finally, a practical proposal is made to detect hash collisions and to prevent them from introducing data integrity violations into our data warehouse.

A Summary of the Scalefree article

I encourage you to first read the original article. My summary is here below.
  1. The business key is what the business users use to identify a business object.
  2. To identify business objects in the data warehouse, we should use a surrogate key that fits in one column instead of the (possibly composite) business key.
  3. The reason to use a single-column surrogate key is that business keys can be large (many bytes per key field, multiple key fields) which makes them slow - in particular for join operations.
  4. In DV 1.0, sequences are used to generate values for surrogate keys.
  5. Using sequences to generate surrogate key values implies a loading process consisting of at least 2 phases that have to be executed in order.

    The previous point requires some context regarding the architecture of the Data Vault model. Without pretending to completely represent all tenets of DV, I believe the explanation below is fair and complete enough to grasp the point:

    DV stores the data that makes up a business object in 3 types of tables:

    • The business key and its corresponding surrogate key are stored in hub-tables.
    • The change history of descriptive attributes are stored in satellite-tables
    • Relationships between business objects are stored in link-tables.

    So, each distinct type of business object corresponds to one hub-table, and at least one, but possibly multiple satellite-tables. The satellite-tables refer to their respective hub-table via the surrogate key. Likewise, link-tables maintain relationships between multiple business objects by storing a combination of surrogate keys referring to the hub-tables that participate in the relationship.

    This means that for a single new business object, any of its satellite-tables can be loaded no sooner than when it is known which surrogate key value belongs to its business key. Since the new sequence value is drawn when loading the new business key into its designated hub-table, this means that the new business object must be loaded into its hub-table prior to loading its satellite-tables. Likewise, any relationships that the business object may have with other business objects can be loaded into link-tables only after *all* business objects that are related through the link have been loaded into their respective hub-tables.

    In practice this means that in DV 1.0, you'd first have to load all hub-tables, drawing new surrogate key values from the sequences as you encounter new business keys. Only in a subsequent phase would you be able to load the satellite- and link-tables (possibly in parallel), using the business key to look up the previously generated surrogate key stored in the hub-tables.

  6. In DV 2.0, hash-functions are used to generate values for surrogate keys.
  7. When using hash-functions to generate surrogate key values, hub-, satellite- and link-tables can all be loaded in parallel.

    The previous point needs some explanation.

    There is no strict, formal definition of a hash-function. Rather, there are a number of aspects that many hash-functions share and on which DV 2.0 relies:
    • Deterministic: a given set of input values will always yield the same output value. In a DV context, deterministic just means one business key maps to exactly one surrogate key.

      To some extent, a solution based on a sequence as generator of surrogate keys also appears to be deterministic, at least within the confines of one physical system.
    • Stateless: While a solution based on a sequence appears to be deterministic, its values are generated by incrementing the previous value, and by "remembering" the mapping from business key to its corresponding surrogate key by storing them together in the hub. Remembering the mapping by storing it in the hub is what makes it deterministic, because only then do we have the ability to look up the surrogate key based on the business key (and vice versa). But if we'd have two separate but otherwise identical systems, and load the same set of business objects into both, but each in a different order, then the mapping from business key to surrogate key will be different, depending on which object was loaded first in that particular system, because the one loaded earlier will draw a lower sequence number.
    • Fixed output size: DV expects a chosen hash-function to always return a fixed-length value output. (And typically, the length of the hash key should be smaller, ideally, much smaller than its corresponding business key)
    • Uniformity: This means that inputs of the hash-function yields results that are very evenly divided across the output domain. If that is the case, the chance that different inputs (i.e., two different business keys) yield the same output value is very small. The phenomenon where two calls to a hash-function, using different arguments, yield the same result value is called a collision. We will have much to say about collisions later on in this article.

    Basically the idea is that a hash-function can be used to generate a surrogate key value that is typically much smaller (and thus, "faster", in particular for join operations) than its corresponding business key, and it can do so without an actual lookup to the set of existing surrogate key values. Ideally a hash-function calculates its output based soley based on its input, and nothing else.

    It is in this latter aspect that makes it different from a sequence, which generates values that have no relationship at all with the values that make up business key. In the case of a sequence, the relationship between surrogate key and business key is maintained by storing it in the hub, and looking it up from there when it is needed.

  8. When using hash-functions there's a risk of collision. A collision occurs when two different inputs to the hash-function generate identical output.
  9. The risk of collision is very small. In a database with more than one trillion hash-values, the probability that you will get a collision is like the odds of a meteor landing on your data center.
  10. The MD5 hash-function is recommended for DV 2.0. As compared to other popular hash-functions (like MD6, SHA1 etc), MD5 storage requirements are relatively modest (128 bits), while the chance of a hash-collision is 'decently low'. It's also almost ubiquitously available across platforms and databases.

Comments

Many tenets of Data Vault thinking make good sense:
  • The focus on business keys, rather than blindly using the primary key of the source systems feeding into the EDW, ensures that the result will serve the needs of the business.
  • While the use of surrogate keys in transactional systems is still often a source of debate, this practice is not controversial in data warehousing.

    That said, I think the technical reasons for introducing a surrogate key as mentioned by the article (make keys smaller and more wieldy to improve join performance) are not as important as the functional requirement of any data warehouse to integrate data from multiple datasources, and ensuring identity there is resilient to change of the source systems.

  • There's a lot to be said in favor of maintaining attribute history in satellite-tables. In particular, the ability to have multiple satellite-tables is appealing, since it allows you to create and maintain groups of attributes that somehow belong together, and maintain them as a unit. For example, grouping attributes based on their rate of change, or according to whether they tend to appear together in queries sounds like a very useful thing.
  • Link-tables also sound like a good idea. Decoupling direct links from the business objects and maintaining them in completely separate tables makes the EDW very resilient to changes in the data sources. In addition it allows you to add extra relationships that may not be directly present in the source systems but which make sense from the point of view of data integration and/or business intelligence.
  • Once we accept the benefits of satellite- and link-tables and surrogate keys, then we must also embrace hub-tables. The model wouldn't make any sense without it, since we need an integration point anyway to maintain the mapping between business and surrogate key (which would of course be the hub-table).
  • Once we accept the modeling entities of DV, we also need to accept any constraints that surrogate key generation may have on the loading process. Of course, a dependency itself is something we would like to avoid, but the fact that a known limitation is recognized and anticipated is a good thing.

Flashback a few years ago

When I read the Scalefree article, I experienced a bit of a flashback that took me back some two, three years to this question ("Hash Key Collisions") by Charles Choi on the DataVault discussions group on linkedin. In case you cannot access linkedin, I'm reprinting the question below:
According to Dan's paper "DV2.0 and Hash Keys", the chances of having a hash key collision are nearly nonexistent (1/2^128). But I still worry.

What if a collision actually does occur that results in the business making a catastrophic decision? Can we really say we have 100% confidence in the "system of fact" when we choose to accept the risk of a collision? What is our course of action from a technology point of view if a collision did actually occur?
The paper "DV2.0 and Hash Keys" that Charles refers to is publicly available though not freely. (You can purchase it via Amazon.) Here's the relevant statement from the article:
The mathematical chances of a collision as a result of using MD5 are (1 / (2^128)) which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

In reality, you would have to produce 6 billion new business keys per second per hub for 100 years to reach a 50% chance of getting a collision. Not very likely to happen in our lifetime.
A similar question was asked by Ray OBrien in response to a post by Linstedt (see: #datavault 2.0 hashes versus natural keys):
collisions are real and MD5 not a good choice. but generally the smaller the input key domain and the larger the Hash output size, the less chance of collision, BUT it is always there.. so I would like to see some comments on the verification steps needed and cost to load of Collision ManagementI. If Integrity of data is important, then this is important.
In Linstedt's answer, you can find a similar expression of this probability, which goes:
because we split business keys across multiple hubs, the chances of collision (even with MD5) are next to none.

Yes, they do exist – but you would have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place.

Objections

Regardless of the merits of DV (which I believe I stated fairly in the previous section) I have a few doubts and objections on the writing and thinking by DV 2.0 advocates I have observed so far; all around the topic of hash-key collisions. My objections are:
  • The wording around the probability of hash-collisions is not helpful to understand the risk. As such, it does not help to decide whether to use DV 2.0, let alone choose a suitable hash-function for a concrete use-case.
  • The actual numbers regarding probability of hash-collisions are stated flat-out wrong on more than just one occasion.
  • The probability doesn't matter if you can't afford to lose any data
  • DV 2.0 does not discuss the consequences of a hash-collision, and no concrete advice is given on how to detect hash-collisions, let alone handle them.
I believe the questions by Charles Choi and Ray OBrien show that I am not alone. At the time they voiced their doubts, DV 2.0 was relatively new and I can understand that maybe at the time these tenets of DV 2.0 would still need to mature.

A couple of years have passed since, and after reading the article on the scalefree company blog, I am sad to observe that, apparently, no progress has been made in DV 2.0 thinking. At least, if such progress has been made, the scalefree company blog article doesn't seem to offer any new views on the matter. Instead, it comes up with an - equally unhelpful - restatement of the probability of hash-collision by comparing it to the chance of being hit by a meteor.

In the remainder of the article I will explain my objections and attempt to offer some thoughts that may help advance these matters.

The DV 2.0 wording around the probability of hash-collisions is not helpful

First, let's try and analyze the wording around probabilities, and what message it conveys.

Distracting Rhetorics

What does it really mean when someone says that "the probability [...] is like the odds of a meteor landing on your data center"? What does it mean, really, when someone says that the probability is "1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456"?

Well obviously, they are saying the chance is very small. Maybe it's just me, but I also sense a level of rhetoric in the wording that seems to be intended to dwarf he reader with Big Serious Numbers.

It's almost as if they're saying: this won't happen, so you shouldn't worry. You're not worrying about meteors hitting your data center all the time, so why worry about a hash-collision? Right?

You can observe that the rhetoric is working too, just look at how Charles Choi voiced his question: "According to Dan (Linstedt) [...] the chances [...] are nearly nonexistent (1/2^128). But I still worry.".

It's as if Charles is apologizing in advance for worrying.

Probabilities are not absolute

There is a more fundamental problem with the probability wording of the previous 2 examples, and that's that they project probability as an absolute.

To be fair, if you read them in context, then you'll notice that both statements are about MD5 collisions. Obviously, this matters, since not all hash-functions have equal probability of collisions. For hash-functions that have a fixed-length output, the chance surely has to have a relationship with the length of the output, since that puts a hard limitation on the number of unique values it could possibly encode.

However, apart from the output length of the hash-function and the algorithm it uses, there is at least one other factor which determines the probability, and that is your data volume.

Intuitively, this is easy to understand: if you have an empty set, the probability of the first hash-value causing a collision is exactly zero, since there is nothing to collide with. At the other extreme end of possibilities, if the set contains as many items as the total number of unique values the hash-function is capable of generating, then the probability of a collision is exactly one, since the entire keyspace has been "used up" already. In between these extremes, we have a growing number of existing entries that could collide with a new entry, so the probability increases from zero to one as the actual number of items (i.e, the number of rows in the hub - the data volume) increases.

While this observation seems trivial, it is important to mention it in this discussion because the aspect of volume is, for some reason, often not touched at all by DV 2.0 advocates. It's a mystery why that should be so, because if we would know the relationship between probability of a collision, maximum possible number of unique values, and the maximum volume of data, then we could reason about these variables sensibly. Like:
  • Given the maximum risk that I am willing to take to lose data due to a collision, what is the maximum volume I can store if I use a 128-bit hash-function?
  • Given the maximum risk of collision that I am willing to take, and given the maximum number of rows I need to store, what would be the mimumum output length of the hash-function I should look for?
  • Given my current data volume, and my current choice of hash-function, what is the risk I am running now of losing data due to collision?

The Birthday Problem

Interestingly, Linstedt does provide one statement that at least takes the data volume into account: "you'd have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place". Let's see what that means exactly.

Apart from the probability, this statement includes the other two variables: 128 bits keylength of MD5; a data volume of 6 bio rows per second for a 100 years. But if you look at the probability (50%) you'll notice how completely useless this wording is, that is, apart from its rhetorical power. Who in their right mind is interested in a system that can only accept half of the data you're trying to store in it? How can you possibly apply this piece of knowledge in any practical sense to a system you have to actually build?

I spend a while thinking about how this statement could have come about, and I have to come to believe that it is actually a restatement of the classical birthday paradox:
The birthday paradox, also known as the birthday problem, states that in a random gathering of 23 people, there is a 50% chance that two people will have the same birthday.
(As compared to Linstedt's statement, the 50% probability stays the same; the 6 billion rows per second for a 100 years is equivalent to the number of people gathered, and the number of possible unique values in the key corresponds to the number of days in a year.)

Whether or not the original birthday problem statement is actually what made Linstedt's word his statement like he did, I think it's clear that a 50% probability of a collision has no practical bearing on building any kind of database. To me, it just sounds like more rhetoric to convince that hash-collisions are really rare.

Probabilities are stated flat out wrong

The discussion of the birthday problem brings us to an actual method to calculate the probability as a function of data volume and key length. The wikipedia article I linked to provides an exact method, as well as many useful approximations, which are much easier to calculate. The birthday problem wikipedia article explains it far better than I ever could do, and also provides this really useful rule-of-the-thumb called the square approximation:
A good rule of thumb which can be used for mental calculation is the relation

p(n) ≈n2
/
2m

which can also be written as

n ≈ (2m * p(n))

which works well for probabilities less than or equal to 0.5
with
n
the actual size of the keyset - i.e, the number of rows you need to fit into the hub
p(n)
the probability of a collision given n.
m
the theoretical maximum size of the keyset, i.e. the maximum number of unique values that your hash-function can encode.
Given a fixed length hash-function, such as MD5, then m can be calculated by raising it to the power of the keylength (expressed as the number of bits):


m = 2 ^ bitlength (or 2bitlength)

(In case this needs explaining: if your key would be just one bit long, then it can old only two values - 0 or 1 or 21. If the key would be 2 bits long, it could hold 2 * 2, or 22 = 4 unique values. With 3 bits, 2 * 2 * 2 or 23 = 8 and so on)

From the discussion above as well as the previous section, we can now conclude that at least one statement of the probability:
The mathematical chances of a collision as a result of using MD5 are (1 / (2^128))
is simply flat-out wrong, since it does not take the data volume into account. Rather, since 2^128 is the number of possible unique values that MD5 can cover, 1 / 2 ^ 128 is the chance that the second row you put into your hub will collide with the first one.

So how big or small are the odds really of running into a MD5 collision in case we're handling a volume of 6 billion rows per second for a 100 years? Using square approximation, we get:

p(n) ≈(6,000,000,000 rows * 60 seconds * 60 minutes * 24 hours * 365.25 days * 100 years)2 ≈ 0.526
/
2 * (2 ^ 128 bits in a MD5 hash-value)


which is in fact closer to 53%. To figure out how many years we would need to insert 6 billion rows per second to achieve the 50% chance of running into a collision, we can use the second form of the formula:

n ≈ (2 * (2 ^ 128 bits in MD5 hash-value) * 50% probability) ≈ 1.84467e+19 rows in the hub

Dividing by 6,000,000,000 rows * 60 seconds * 60 minutes * 24 hours * 365.25 days will give you 97 and slightly less than a half year.

The point of doing these calculations here is obviously not to prove Linstedt wrong by showing you'd already arrive at a 50% after only 97 years and some instead of after a 100. Nor is it to determine that after 100 years, the chance is actually closer to 53%. Besides the fact that I'm using an approximation, neither makes any sense anyway, because the probability of 50% is already way, way beyond any definition of a working system.

The point I am trying to make is that it is perfectly possible to reason about large numbers and to clearly and transparently demonstrate how they are calculated. Using square approximation, you have a tool to calculate the value of the third variable once you have the value of the other two, allowing you to reason about it from three different angles.

I think we can all agree that's a much better position than getting stumped by Really Seriously Big Numbers.

Probabilities add up for each keyset

So far, we've just looked at the probability for encountering a collision while loading a single hub. But the probability increases as you have more hubs.

Intuitively this is clear because each hubs could encounter a collision independently. So, the chance of suffering a collision in either one of them grows as you have more hubs to maintain, and is quite a bit more than the chance of suffering a collision in just one particular hub.

If we'd like to compute the chance of getting a collision in at least one hub we can apply the following calculation:
1 - ((1 - Ph1) * (1 - Ph2) * ... * (1 - PhN))
with:
Ph1
Probability of a collision in the first hub
Ph2
Probability of a collision in the second hub
...
PhX
Probability of a collision in the last hub
The rationale behind this is that if the probibility of a collision is P(n), then the chance of not having a collision is 1 - P(n). To calculate the chance of not having a collision in any of the hubs, we have to multiply the individual chances of not having a collision in one particular hub with each other. If that number is the chance of not having a collision in any of the hubs, then all remaining probability must mean there is a collision in one or more hubs. So the chance of having at least one collision is obtained by substracting the probability from having no collision at all from 1.

So the chance that someone ever will encounter a collision could be quite a bit larger than you'd expect if you're focussing on just one hub.

No matter how slight a Probability, I can't afford to lose data

Now we arrive at a more fundamental objection regarding the matter of using hash-values as keys in your database.

If you re-read Charles question, you'd notice that he is politely explaining that, although he understands and appreciates that a MD5 hash-collision maybe really rare, he simply doesn't ever want to lose any data because of it. Ray OBrien raises the exact same point, even mentioning data integrity as the reason why he cares.

When this issue is put forward in DV 2.0 disucssions, it usually means the end of any meaningful discussion. The answers that DV 2.0 advocates usually give in response typically match one of the following:

"Look, do you realize how rare a hash-collision is?"
Yes thank you. You just stumped me with some Really Seriously Big Numbers, and I get it. Super rare. I just don't want to lose data though.
"You don't have to use 128-bits MD5, you can use a hash-function that returns larger values, like 160-bits SHA1. Collisions will then be, you know, even more rare."
Perfect thanks. Did I mention I can't afford to lose data?
"We use 2 hash-functions as key and it works for us."
Ah, I get it now. You made collisions Super-duper-rare, how clever. So will you never lose data now?
"I have built hundreds of Terabyte-sized data warehouses, and I never encountered a hash-collision."
Well let me guess. Might that be because they are very rare?
"Teradata is using hashing to solve MPP data distribution, and Hadoop uses hashing in HDFS. If it works for them, then why wouldn't it work in DV 2.0"
So you're saying Teradata and Hadoop use hashing for some purpose, and DV2.0 is using hashing for a completely different purpose, and now you want me to explain why it works in one use-case but not for a completely different use-case? That's...interesting.

How about: Teradata and Hadoop are not using hash values as primary keys, and DV 2.0 is?
"Look, why are you so worried about hash-collisions? You're not worrying all the time about a meteor hitting your data center, are you?"
Actually, I do. That's why our database is geographically distributed across data centers.
"Ha! Gotcha now. What about two meteors? Wouldn't that be comparable to using two hashes?"
I suppose it would. Difference is, I can't help meteors falling on my data centers. But I can choose to stick to sequences instead of hash-functions.
"But I just explained, sequences are a bottleneck and prevent you from parallel loading your EDW!"
Yes I heard. And I asked my customer: they are pretty sure about how they feel regarding the possibility of losing data, and they clearly told me they'd rather wait around for the data to be loaded as compared to being able to report super-quickly on wrong data.
"We use hash-keys to store tweets for sentiment analysis, and our results are pretty accurate, even if we lose data sometimes."
I'm sure you are- good for you! But we manage a monetary transaction log and we feel that the risk of losing one $1,000,000,000 transaction just doesn't justify loading 1,000,000,000 transactions worth $1 super-fast in a parallel fashion. Silly us eh?


And that's really all there is to it: Probabilities don't mean a thing if you're really sure you don't want to lose any data. When it happens, it is no comfort that you were the one to have had such extraordinarily bad luck experiencing it.

Another thing people may overlook is that the probability also doesn't tell you when it will happen. The only real guarantee you have is that inserting the first key in an empty hub will always succeed. But already the 2nd row might collide. It probably won't, and the odds are really slim. But it might. If you're sure you don't want that, then don't use hash values as keys. It's really that simple.

Sure, there might be other risks that could make us lose data. For example, the probability of disk corruption might be larger than that of a hash-collision. But it doesn't follow that we should be setting ourselves a trap if we can avoid it, especially if you know how to avoid it.

We cannot control disaster like disk corruption or meteor impact. If we could though, we would! Whether to use hash keys or to stick to sequences is a conscious choice, so let's be sure we make it based on information and requirements, and rather not based on some analogy that is chosen with the express purpose of making you feel a little bit ridiculous for being so averse to taking a risk.

If you're building a database for someone else, and you're considering to use hashes as keys for your data, then be prepared to ask you customer: "How many data can you afford to lose?" or "In the event that we cannot load some data due to hash-collisions, how much time and effort can we spend to take it offline so we can fix it?"

Another way to think about it is this: Suppose you would, in fact, lose data because of a collision. Then how comfortable are you to admit to your customer that you constructed a solution, that, by design, could end up giving you wrong results, while there was an alternative that guarantees correctness, at least to the extent of things you can control? And suppose you would get wrong results, did you anticipate just how wrong those results could be?

I truly feel that considerations like this are not on the database/data warehousing professional - they are on the customer. It's their data. Please, respect that.

What if we do have a collision?

I get that there are use cases where you might want to accept the small risk of a collision. But you cannot really, truly make that assessment if you haven't considered and anticipated it as if it is a real event actually hitting you. I think DV 2.0 falls short in nourishing healthy discussion regarding the anticipation of such events.

So, what will happen if you have your hash-keys in place, and you encounter a collision? We can try and anticipate a few concrete scenarios.

First of all, will you even detect a collision when it happens? The Scalefree article has a few flow diagrams showing how raw data is staged, and then loaded into a hub. In that flow, the row is dropped when the hash already exists. So the question now is, why did the hash already exist?

Obviously, it's possible that we already loaded this business object, and we're merely seeing it again. In that case, we're fine and we'll simply be loading newer data into the satellites and links for that business object. But it's also possible that this is an entirely different business object that happened to yield the exact same hash-value as that of a different business object loaded earlier. In other words, you now have a collision. For the hub, it will pass by unnoticed, but the satellites and links that point to that business object will now store data pertaining to more than one distinct business objects.

So in this case, we're not losing data, but compromising the integrity of the business object that arrived earlier. Our database integrity is now violated and our queries will return wrong results. You won't know though, because you didn't attempt to detect a collision. To your data vault, the distinction between multiple different business objects has ceased to exist.

I don't know about you but this does not feel like a happy place to me - especially if this an inevitable and avoidable consequence of the design. (And a whole bunch of bad luck of course).

Alternatively, we build our solution in such a way that we can at least detect collisions. Once we detect it, we can maybe prevent loading associated data for the satellite-tables and link-tables for the colliding business object. This means we will be completely ignoring the later arriving business object, as if it isn't there.

We have now lost data. That is not a good thing, but at least this allows the earlier arriving business object to maintain its integrity. Our query results won't be wrong, they will just be incomplete. To me this is a slightly happier place, but the fact that it's a matter of fate which one of the objects made it into the database, and which one was rejected still makes me feel that the solution has failed.

But suppose we do want to go for that treatment (after of course getting confirmation from the business that this is really what they want) - how can we implement it? Well, at the very least, the load process for either the hub or the staging area would need to compare the hash value as well as the business key. Only if both are equal can we consider the objects equal.

Making the comparison is not hard but it will of course be slower than only calculating the hash, because in this case you need a lookup on the hub, just like you did with the DV 1.0 solution based on sequences.

But what if we detect the collision in this way? How can we then use this information to prevent loading the associated satellite- and link-table data?

Introducing a collision table

We could store collisions in a special collision table. We'd get one such table for each hub. The collision table would have the same layout as the hub table to which it corresponds, and you'd use it to store the colliding hash, as well as the business key for which the collision was met. The key of the collision table would have to be made up of both the hash key as well as the business key, so that we can handle multiple collisions for the same hash key.

Once the collision table is in place, and the process for loading the hub-tables is modified to detect and store collisions, we can think about the process for loading the satellite- and link-tables. I can see two options
  • Have the load process for satellite- and link-tables do a lookup to the collision table to see if you need to discard data for business objects with collisions
  • Check collision tables after the load and run a clean-up process to restore integrity after detecting new collisions.

Collision table lookup

This solution relies on the process that loads the satellite- and link-tables to make a lookup to the collision table, using both the calculated hash and the business key. If collisions really are as rare as they should be, then that lookup should be really fast, because the collision tables will be pretty much empty.

If all goes well, then the lookup will fail. This means we have no collision and we can proceed loading the satellite- and link-tables. In the rare event that the lookup succeeds there is apparently a hash-collision, and we must not load the satellite- and link-tables to prevent violating the integrity of our data. You are now in a position to either discard the data, or to store it someplace else in case you have a clever idea of reconciling the data later on.

However, we now have reintroduced the constraint on the loading process, because we now rely on the process that loads the hubs to also detect and store collisions in the collision table.

So, with this solution, we lose the ability to load hub-tables in parallel with satellite- and link-tables. What may be of some comfort in comparison with the DV 1.0 sequence based solution is that the collision table lookup will be much faster than a hub-lookup to find the value generated by a sequence, because the collision table will be pretty much empty. So, the burden of the constraint and loading dependency should be much lighter than in the case of a DV 1.0 sequence based solution.

Another important drawback is that if your solution spans multiple systems, you need to maintain one set of collision tables somewhere, and all loading processes will be dependent upon them. In other words, the solution is not stateless anymore.

Clean-up after load in case of new collissions

Alternatively, we keep loading hub-, satellite- and link-tables in parallel, and we check the collision tables after each load to see if the last load introduced any new collisions. If we find that it did, we need to perform clean-up after the load.

The way clean-up would work is as follows: our load process should have been logged and our satellite- and link-tables should have metadata identifying the load process that put its contents in the data warehouse. The load identifier would also be stored in the collision table. Using that information, we can identify which new collision our last load introduced. We now have the load identifier as well as the hash key of the new collision, and we can then use that to delete all satellite- and link-table rows that have the load identifier of our latest load, as well as the hash key of the colliding business object.

After clean-up, we have restored data integrity for those business objects that encountered a collision, up to the point prior to the last load.

Now, we know that our last load brought us a business object that had a hash collision with some business object that was already in our data warehouse, and we made the conscious decision to reject that data for now, or maybe store it somplace else untill we know how to reconcile it. But the last load might also have brought us data that actually belonged to the business object that already existed in our data warehouse. We would really like to reload that part of the data for the existing business object. Our clean-up process had no way of distinguishing between satellite- and link- data for the one or the other business object, because it only knows about their colliding hash keys. In other words, our clean-up process might have removed data that actually did belong to the already existing business object, and now we need to put that back.

The solution would be to have an alternative load process especially for this -hopefully exceptional- case.

The alternative load process would be similar to the collision table lookup solution described above. It would only load satellite- and link-tables, and it would include a lookup to the collision table. If the lookup fails, we're dealing with data belonging to the existing business object and we can load it. If the lookup succeeds then this is data that belongs to a new business object that caused a collision and we should discard it or store someplace else for later reconciliation.

If things go the way they should, then the clean-up-and-reload process should occur seldom. And if we need to run it, it would probably be quite fast since it deals with only a few business objects - typically only one.

The only drawback now is that our data warehouse lived through a short period where integrity was compromised during the load process. But at least, we can repair integrity for all existing business objects, and selectively discard only data for those business objects that suffer from hash collion with an already existing business object.

While this approach still relies on keeping collision tables around, we regain our ability to load hub-, satellite- and link-tables in parallel. We can even do loads spanning multiple systems; we just need to take care to clean those up as well in the case we do encounter a collision.

Other solutions?

I don't want't to pretend this is an exhaustive list - I'm hoping there are more options and I just can't think of them right now.

How to load the colliding business objects?

What these scenario's do not solve is loading the later arriving business object. We only managed to prevent these objects from entering our data warehouse, but have not found a solution to load that data as well.

And, We can't - not unless we change the key.

On the other hand, changing the key might just be doable: you could decide to try another hash-function for at least that hub. You would need to update all satellites and links pointing to that hub (and of course, the hub itself) and rehash every row that points to it.

Of course, since you're still relying on hash-keys, just - hopefully - larger ones, you haven't solved the problem, you've just increased the odds. And you might even run into new collision while you're doing the rehashing operation. But that's just the life you've chosen. At least we now have something that resembles dealing with the problem rather than praying it won't happen.

I guess my main point here is - make sure every stakeholder is actually accepting the risk and make sure the procedures for dealing with a collision are specified and tested. The fact that the chance you'll need it may be next to neligible is not a license to pretend you do not need to be prepared to do these tasks. If it is decided that you'll be taking the risk, then actually do take the risk, and take it seriously.

Can't we have our cake and eat it too?

Isn't there some way we can benefit from hashes and still, magically immunize ourselves against hash-collisions? It turns out we can.

To recap:
  • DV, like other data warehousing methods - suggests using surrogate keys, because business keys are largish and unwieldy, and slow down join operations.
  • DV 2.0 suggests using hash-keys to avoid sequences, which make it impossible to load hubs, satellites and links all in parallel, and which slow down the load due to a lookup in a large hub.
This maybe a longshot, but it has as advantage the guarantee that it will work. As should be amply clear from the discussions in this article, hash-keys always have the chance of a collision, and it doesn't matter how small the risk is if you already know you do not want to accept losing data or giving up integrity. So, what is clear is that if that is the requirement, you cannot use hashes as keys. Period.

But that does not mean we cannot benefit from hashes.

Especially if the hash-key is small in comparison with the business key, then we could build up our keys as a composite of the hash-code, followed by the field or fields that make up the business key. If our database uses B-tree indexing, then any joins will be able to resolve the join in the very vast majority of cases only based on the hash-key column, which should be the first column in the key definition. You would still keep using the fields of the business key in your join conditions, to ensure that, in case of a hash-collision, the query will still return the correct result. Since the collisions are so rare, the database will end up with very few rows after it has resolved the join over the first field in the key, so the overhead of such a large key should be minimal.

Of course, this solution does not help in cutting down the amount of data you need to store - that will be much more in this scenario since you keep dragging the business keys literally everywhere: every satellite and every link table that points to this hub will inherit the column for the hash, as well as all columns that make up the business key. But if it helps - it should just add to the storage requirements, and not that much in terms of join processing.

The benefits of the -relatively- fast join will break down though when the database already uses hash-joins. In those cases, it will not be able to use the first column of the keys as prefix.

Finally

I hope you enjoyed this article. I am super curious to hear from DV practitioners if they have scenarios we can learn from. Drop a line in the comments! I'm looking forward to it.

Thursday, March 30, 2017

Announcing Shivers - Visualizing SAP/HANA Information View Dependencies

Dear SAP HANA DBa's, developers, architects etc, we at Just BI released Shivers.

What is Shivers?

Shivers stands for SAP/HANA Information Viewers. It is a tool to analyze and visualize dependencies between SAP/HANA information views (that is, Analytic Views, Attribute Views, and Calculation Views) and catalog objects (such as tables and views). Below is a screenshot of Shivers so you can get an impression of what it looks like:

Why Shivers?

In our work as SAP- and Business Intelligence consultants, SAP/HANA information views are a key tool in delivering Business Intelligence and Custom App Development Solutions to our customers.

In some cases, information views can get quite complex. Visualization and documentation features offered by development tools, like HANA Studio, do not always provide the kind of insight we need to create or maintain our solutions.

One very particular and concrete case is creating technical documentation, or preparing presentation materials for knowledge transfer sessions to handover to the customer support organization. In the past, some of our consultants would manually piece together overview slides of the data objects (information views, tables) that make up our solutions. I decided to try to whip up a solution that would make tasks like this a little simpler.

How do I install and run Shivers?

The simplest option is to download a shivers.zip archive from github. You can then unzip the archive, and open the index.html file inside in a modern webbrowser (tested with chrome and IE11 but should work in all modern browsers).

Note that you do not need to install shivers on a web server or HANA XS server. It runs directly from your local disk in your browser. Of course, you can install Shivers on a webserver or HANA XS server if you like, and then you can open Shivers by navigating to the url corresponding to the place where you installed Shivers. But this won't affect how Shivers works.

How do I load information views into Shivers so I can visualize them?

On the Shivers toolbar in the top of the application, you'll find a file chooser button. When you click that, a file browser will open. Use it to browse to whatever location on your local disk where you keep your information view source files.

I for example have my HANA Studio workspace stored in C:\Users\rbouman\hana_work, so I browse to that location and then find the appropriate path in the subdirectories to the HANA System and package I'm interested in. When I found a directory containing information view source files, I selected them, and confirm the file chooser.

Shivers will then prompt for a package name. It needs to do this to know how other views could refer to this set of views. Since the package name is itself not stored in the information view source files, the user has to supply that information. Later, when analyzing dependencies, Shivers will use the entered package name and match it with pacakge names it encounters inside the information view source files, when that information view refers to other information views.

After these steps, Shivers will populate the treeview on the left-hand side of the screen with the package structure and entries for the information views you loaded.

Note that shivers has a log tab. The log tab will report about the loading process, and it will also report it whenever it loads and information view that depends on other information views that are currently not loaded into Shivers. If you want to make complete and exhaustive graphs, you should then also load whichever file is reported in the log tab as an omitted dependency.

When the loaded files are visible in the treeview, you can click on any of the treenodes representing an information view. When you select one, a tab will open inside the Shivers window, showing a dependency graph of the selected information view.

Shivers currently does not do any clever layout of your dependency graphs. But you can manually drag and place items inside the graph to make it look good.

When you are done editing your graph, you can right click it, and choose "Export Image" to export the visualization to a png file, which you can then use in your technical documentation or knowledge transfer presentation slide deck.

What about HADES?

In my previous blogpost (http://rpbouman.blogspot.nl/2016/10/sap-hana-on-which-base-columns-do-my.html) I wrote about HADES, which is a bunch of (open source) utilities to report on information view metadata.

In a way, HADES and SHIVERS complement each other.

HADES are server-side tools (so far, only stored routines) that help analysis of information views. HADES is data-oriented, and requires access to a HANA SQL client and allows you to extract just about any information you'll ever want to know about your information views. But, HADES is in a way also very low-level, and using it effectively requires SQL skills.

Shivers is a graphically oriented client-side tool. It is implemented as a web page, that you start in your browser, directly from your file system. Shivers does not require any connection to a HANA Server. Rather, you use checked out information view source files (.analyticview, .attributeview and .calculationview files), load them into shivers, and then the tool does static code and dependency analysis. So far, possibilities to extend or influence reporting of Shivers are very limited. Shivers draws dependency graphs of your information views, and for now - that is it.

Why is Shivers an offline client-side tool?

Implementing Shivers as non-server, client side tool was a very deliberate decision. At our customer's systems, we cannot assume that we are allowed to install our tools on the HANA Server for just any purpose. So we really must have a solution that works regardless of that. What we can assume is that we have access to the information view source files, since we mostly work with HANA Studio and check out information view source files from the HANA repository all the time.

What are the terms and conditions for using Shivers?

Shivers is open source software under the Apache 2.0 License. This should give you all the freedom to copy, use, modify and distribute Shivers, free of charge, provided you respect the Apache License.

How can I get support Shivers?

We at Just BI are always happy to help if you run into any issue, but Shivers is not currently a for-profit solution. If necessary we can always negotiate professional support should you require it.

We Want your Feedback!

Please check out Shivers and let us know what you think. You can either drop me a line at this blog. Or you can post issues in the github change tracker.

Don't be shy - bugs, new feature proposals, critique - everything is welcome. Just let us know :)

You can also fork the shivers project, and contribute your work back via a pull request.

Monday, October 24, 2016

SAP HANA: On which base columns do my information views depend?

For one of Just-BI's customers, we're currently working to productize a custom-built proof-of-concept SAP/HANA application.

This particular application is quite typical with regard to how we use SAP/HANA features and techniques: it is a web-application for desktop and tablet devices, that we serve through SAP/HANA's XS engine. For database communication, we mostly use OData (using XS OData Services), as well as the odd xsjs request (in this case, to offer MS Excel export using ODXL)

The OData services that our application uses are mostly backed by SAP/HANA Calculation views. These, in turn, are built on top of a mixed bag of objects:

  • Some of these are custom base tables that belong to just our application;
  • Some are base tables that collect output from an advanced analytics recommendations algorithm that runs in an external R server
  • Some are information views (analytic views, attribute views and calculations views) that form a virtual datamart (of sorts) on top of base tables replicated from various SAP ERP source systems to our SAP/HANA database.

One of the prerequisites to productize the current solution is a re-design of the backend. Redesign is required because the new target system will be fed from even more ERP source systems than our proof-of-concept environment, and the new backend will need to align the data from all these different ERP implementations. In addition, the R algorithm will be optimized as well: in the proof-of-concept environment, the advanced analytics algorithm passes through a number of fields for convenience that will need to be acquired from elsewhere in the production environment.

To facilitate the redesign we need to have accurate insight into which base columns are ultimately used to implement our application's data services. As it turns out, this is not so easily obtainable using standard tools. So, we developed something ourselves. We think this may be useful for others as well, which is why we'd like to share it with you through this blog.

Information View Dependencies

The standard toolset offers some support to obtain dependencies for information views (analytic views, attribute views and calculation views):

As it turns out, these standard tools do not give us the detailed information that we need. The HANA studio features are mainly useful when designing and modifying information views, but do not let us obtain an overview of all dependencies, and not in a way that we can easily use outside of HANA Studio. The usefulness of querying the OBJECT_DEPENDENCIES system view is limited by the fact that it only reports objects - that is, base tables or information views - but not the columns contained therein.

It looks like we're not the only ones struggling with this issue.

Getting the information view's definition as XML from _SYS_REPO.ACTIVE_OBJECT

To get the kind of information we need, we're just going to have to crack open the definition of the information view and look what's inside. As it turns out, HANA stores this as XML in the CDATA column of the _SYS_REPO.ACTIVE_OBJECT system table, and we can query it by package name, object name and object suffix (which is basically the extension of the file containing the definition that is stored in the repository):

SELECT  CDATA
FROM    _SYS_REPO.ACTIVE_OBJECT
WHERE   PACKAGE_ID    = 'my.package.name'
AND     OBJECT_NAME   = 'CA_MY_CALCULATION_VIEW'
AND     OBJECT_SUFFIX = 'calculationview'

With some effort, _SYS_REPO.ACTIVE_OBJECT can be joined to OBJECT_DEPENDENCIES to discover the objects on which the information view depends:

SELECT      od.BASE_SCHEMA_NAME
,           od.BASE_OBJECT_NAME
,           od.BASE_OBJECT_TYPE
FROM        _SYS_REPO.ACTIVE_OBJECT ao
INNER JOIN  OBJECT_DEPENDENCIES     od
ON          '_SYS_BIC'                          = od.DEPENDENT_SCHEMA_NAME
AND         ao.PACKAGE_ID||'/'||ao.OBJECT_NAME  = od.DEPENDENT_OBJECT_NAME
AND         'VIEW'                              = od.DEPENDENT_OBJECT_TYPE
WHERE       ao.PACKAGE_ID                       = 'my.package.name'
AND         ao.OBJECT_NAME                      = 'CA_MY_CALCULATION_VIEW'
AND         ao.OBJECT_SUFFIX                    = 'calculationview'

(Note: OBJECT_DEPENDENCIES reports all dependencies, not just direct dependencies)

Or we can query the other way around, and find the corresponding model for a dependency we found in OBJECT_DEPENDENCIES:

SELECT      ao.PACKAGE_ID
,           ao.OBJECT_NAME
,           ao.OBJECT_SUFFIX
,           ao.CDATA
FROM        object_dependencies     od
INNER JOIN  _SYS_REPO.ACTIVE_OBJECT ao
ON          SUBSTR_BEFORE(od.base_object_name, '/') = ao.package_id
AND         SUBSTR_AFTER(od.base_object_name, '/')  = ao.object_name
AND         ao.object_suffix in (
              'analyticview'
            , 'attributeview'
            , 'calculationview'
            )
WHERE       od.DEPENDENT_SCHEMA_NAME = '_SYS_BIC'
AND         od.DEPENDENT_OBJECT_NAME = 'my.package.name/CA_MY_CALCULATION_VIEW'
AND         od.DEPENDENT_OBJECT_TYPE = 'VIEW'

NOTE: It turns out that querying OBJECT_DEPENDENCIES fails at reporting dependencies between analytic views and the attribute views they use. To capture those dependencies, you need to query _SYS_REPO.ACTIVE_OBJECTCROSSREF.

Parsing the information view's XML definition with stored procedure p_parse_xml

Once we obtained the XML that defines the information view, we still need to pull it apart so we can figure out how it is tied to our base table columns. To do that, we first apply a general XML parser that turns the XML text into a (temporary) table or table variable, such that each row represents a distinct, atomic element inside the XML document. For this purpose I developed a HANA stored procedure called p_parse_xml. Here is its signature:

create PROCEDURE p_parse_xml (
  -- XML string to parse
  p_xml nclob
  -- Parse tree is returned as a table variable
, out p_dom table (
    -- unique id of the node
    node_id           int
    -- id of the parent node
  , parent_node_id    int
    -- dom node type constant: 1=element, 2=attribute, 3=text, 4=cdata,
    --                         5=entityref, 6=entity, 7=processing instruction, 
    --                         8=comment, 9=document, 10=document type, 
    --                         11=document fragment, 12=notation
  , node_type         tinyint
    -- dom node name: tagname for element, attribute name for attribute, 
    --                target for processing instruction,
    --                document type name for document type,
    --                "#text" for text and cdata, "#comment" for comment, 
    --                "#document" for document, "#document-fragment" for document fragment.
  , node_name         nvarchar(64)  
    -- dom node value: text for text, comment, and cdata nodes, data for processing instruction node, null otherwise.
  , node_value        nclob
    -- raw token from the parser
  , token_text        nclob         
    -- character position of token
  , pos               int           
    -- lenght of token.
  , len               int           
  )
  -- flag whether to strip text nodes that only contain whitespace from the parse tree
, p_strip_empty_text  tinyint default 1
) 

Note that you can download the source dode for the entire procedure from github. The p_parse_xml procedure depends on p_decode_xml_entities, so if you want to run it yourself, be sure to install that first.

To see how you can use this, consider the following, simple example:

call p_parse_xml(
'<parent-element attribute1="value1">
  <child-element attribute2="value2" attribute3="value3">
    text-content1
  </child-element>
  <child-element att="bla">
    text-content2
  </child-element>
</parent-element>', ?);

This gives us the following result:

+---------+----------------+-----------+----------------+-----------------------------+-------------------------------------------------------------+-----+-----+
| NODE_ID | PARENT_NODE_ID | NODE_TYPE | NODE_NAME      | NODE_VALUE                  | TOKEN_TEXT                                                  | POS | LEN |
+---------+----------------+-----------+----------------+-----------------------------+-------------------------------------------------------------+-----+-----+
| 0       | ?              | 9         | #document      | ?                           | ?                                                           | 1   | 221 |
| 1       | 0              | 1         | parent-element | ?                           | <parent-element attribute1=\"value1\">                      | 1   | 36  |
| 2       | 1              | 2         | attribute1     | value1                      |  attribute1=\"value1\"                                      | 2   | 20  |
| 3       | 1              | 1         | child-element  | ?                           | <child-element attribute2=\"value2\" attribute3=\"value3\"> | 41  | 55  |
| 4       | 3              | 2         | attribute2     | value2                      |  attribute2=\"value2\"                                      | 42  | 20  |
| 5       | 3              | 2         | attribute3     | value3                      |  attribute3=\"value3\"                                      | 62  | 20  |
| 6       | 3              | 3         | #text          | text-content1               | text-content1                                               | 96  | 23  | 
| 7       | 1              | 1         | child-element  | ?                           | <child-element att=\"bla\">                                 | 139 | 25  |
| 8       | 7              | 2         | att            | bla                         |  att=\"bla\"                                                | 140 | 10  |
| 9       | 7              | 3         | #text          | text-content2               | text-content2                                               | 164 | 23  |
+---------+----------------+-----------+----------------+-----------------------------+-------------------------------------------------------------+-----+-----+

The result is a tabular representation of the XML parse tree. Each row essentially represents a DOM Node, and the column values represent the node's properties:

  • The NODE_TYPE column tells us what kind of node we're dealing with. Values in this column conform to the w3c standard document object model (DOM) enumeration of node type values. The most important ones are 1 for element nodes ("tags"); 2 for attributes, and 3 for text. The entire parse tree is contained in a document node, which has node type 9.
  • The NODE_ID is the unique identifier of the node while PARENT_NODE_ID points to whatever node is considered the parent node of the current node. The parent node is basically the container of the node. As you can see, the element with NODE_ID=3 has the element node with NODE_ID=1 as parent. These correspond to the first <child-element> and <parent-element> elements in the document. Attribute nodes are also marked as children of the element to which they belong. The DOM standard does not consider attributes children of their respective element node, but p_parse_xml does, mainly to keep the result table as simple as possible.
  • The NODE_NAME column is a further characterization of what kind of node we're realing with. For most node types, the node name is a constant value which is essentially a friendly name for the node type.For example, document nodes (NODE_TYPE=9 always have #document as NODE_NAME, and text nodes (NODE_TYPE=3) always have #text as NODE_NAME. For element nodes and attribute nodes (NODE_TYPE is 1 and 2 respectively), the NODE_NAME is not constant. Rather, their node name conveys information about the meaning of the node and its contents. In other words, element and attribute names are metadata.
  • The NODE_VALUE column contains actual data. For element and document nodes, it is alway NULL. For attributes, the NODE_VALUE column contains the attribute value, and for text nodes, it is the text content.
  • The POS lists the position where the current element was found; the LEN column keeps track of the length of current item as it appears in the doucment. Typically you won't need these columns, except maybe for debugging purposes. The TOKEN_TEXT column is also here mostly for debugging purposes.

Extracting Base Columns from Analytic and Attribute views

If you examine the XML definition of Analytic and/or Attribute views, you'll notice that table base columns are referenced by <keyMapping> and <measureMapping>-elements like this:

<keyMapping schemaName="...database schema..." columnObjectName="...table name..." columnName="...column name..."/>

So, assuming we already parsed the model of an analytic or attribute view using p_parse_xml and captured its result in a table variable called tab_dom, we can run a query like this to obtain all <keyMapping> and <measureMapping>-elements:

select     mapping.*
from       :tab_dom mapping
where      mapping.node_type = 1            -- get us all elements
and        mapping.node_name in ('keyMapping' -- with tagnames 'keyMappping' or 'measureMappping'
                                   ,'measureMapping') 

While this gives us the actual elements, the actual data we're interested in is buried in the attributes of the <keyMapping> and <measureMapping>-elements. You might recall that in the p_parse_xml result, attribute nodes have NODE_TYPE=2 appear as childnode of their respective element. So, we can extract all attributes of all <keyMapping> and <measureMapping>-elements with a self-join like this:

select      mapping_attributes.*
from        :tab_dom            mapping
inner join  :tab_dom            mapping_attributes
on          mapping.node_id   = mapping_attributes.parent_node_id -- find all nodes that have the keymapping element node as parent 
and         2                 = mapping_attributes.node_type      -- but only if their node type indicates they are attribute nodes
where       mapping.node_type = 1                                    
and         mapping.node_name in ('keyMapping', 'measureMapping')                         

Since we are interested in not just any attribute node, but attribute nodes having specific names like schemaName, columnObjectName and columnName, we should put a further restriction on the NODE_NAME of these attribute nodes. Also note that this query will potentially give us multiple rows per <keyMapping> or <measureMapping>-element (in fact, just as many as there are attributes). Since we'd like to have just one row for each <keyMapping> or <measureMapping>-element having the values of its schemaName, columnObjectName and columnName attributes in separate columns, we should rewrite this query so that each attribute gets its own self-join.

Thus, the final query becomes:

select     mapping_schemaName.node_value       as schema_name
,          mapping_columnObjectName.node_value as table_name
,          mapping_columnName.node_value       as column_name
from       :tab_dom               mapping
inner join :tab_dom               mapping_schemaName                      -- get the attribute called 'schemaName'
on         mapping.node_id      = mapping_schemaName.parent_node_id
and        2                    = mapping_schemaName.node_type 
and        'schemaName'         = mapping_schemaName.node_name 
inner join :tab_dom               mapping_columnObjectName                -- get the attribute called 'columnObjectName'
on         mapping.node_id      = mapping_columnObjectName.parent_node_id 
and        2                    = mapping_columnObjectName.node_type 
and        'columnObjectName'   = mapping_columnObjectName.node_name 
inner join :tab_dom               mapping_columnName                      -- get the attribute called 'columnName'
on         mapping.node_id      = mapping_columnName.parent_node_id       
and        2                    = mapping_columnName.node_type 
and        'columnName'         = mapping_columnName.node_name 
where      mapping.node_type = 1
and        mapping.node_name in ('keyMapping', 'measureMapping')                         

Extracting base columns from Calculation views

Getting the base columns used in calculation views is a bit more work. However, the good news is that in terms of the queries we need to write, it does not get much more complicated than what we witnessed for analytic and attribute views in te previous section. Querying the xml parse tree almost always boils down to finding elements and finding their attributes, and then doing something with their values.

The reason why it is more work to write queries against the model underlying calculation views is that the XML documents that define calculationviews use an extra level of mapping between the objects that represent the source of the columns and the way these columns are used inside the view. The following snippet might illustrate this:

<Calculation:scenario ...>
  ...
  <dataSources>
    ...
    <DataSource id="...some id used to refer to this datasource..." type="DATA_BASE_TABLE">
      ...
      <columnObject schemaName="...db schema name..." columnObjectName="...table name..."/>
      ...
    </DataSource>
    ...
  </dataSources>
  ...
  <calculationViews>
    ...
    <calculationView>
      ...
      <input node="#...id of a DataSource element...">
        ...
        <mapping source="...name of a column used as input..." ... >
        ...
      </input>
      ...
    </calculationView>
    ...
  </calculationViews>
  ...
</Calculation:scenario>

The method for finding the base columns can be summarized as follows:

  1. Get all <DataSource>-elements having a type-attribute with the value "DATA_BASE_TABLE". These elements represent all base tables used by this view. Other types of objects used by this view will have another value for the type-attribute.

    To obtain the schema and table name of the base table, find the <columnObject>-childelement of the <DataSource>-element. Its schemaName and columnObjectName-attributes respectively contain the database schema and table name of the base table.

    The <DataSource>-elements have an id attribute, and its value is used as unique identifier to refer to this data source.

  2. Find all instances where the base table datasources are used.

    A calculation view is essentially a graph of data transformation steps, each of which takes one or more streams of data as input, turning it into a stream of output data. In the XML document that defines the calculation view, these transformation steps are represented by <calulationView>-elements. These <calulationView>-elements contain one or more <input>-child elements, each of which represents a data stream that is used as input for the transformation step.

    The <input>-elements have a node-attribute. The value of the node-attribute is the value of the id-attribute of whatever element it refers to, prefixed by a hash-sign (#).

    Note that this is a general technique to reference elements within the same XML document. So, in order to find where a <DataSource>-element is used, it is enough to find all elements in the same XML document that reference the value <DataSource>-element's id-attribute in the value of their node-attribute.

  3. Once we have the elements that refer to our <DataSource>-element, we can find out which columns from the data source are used by looking for <mapping>-child elements.

    The <mapping>-elements have a source-attribute, which holds the column-name.

With these steps in mind, the SQL query we need to do on the calculation view parse tree becomes:

select     distinct
           ds_co_schemaName.node_value         schema_name
,          ds_co_columnObjectName.node_value   table_name
,          ds_usage_mapping_source.node_value  column_name
--
-- ds: DataSource elements (Note the WHERE clause)
-- 
from       :tab_dom                            ds
--
-- ds_type: demand that the value of the type-attribute of the DataSource elements equal 'DATA_BASE_TABLE'
--          this ensures we're only looking at base tables.
-- 
inner join :tab_dom                            ds_type
on         ds.node_id                        = ds_type.parent_node_id
and        2                                 = ds_type.node_type
and        'type'                            = ds_type.node_name
and        'DATA_BASE_TABLE'                 = cast(ds_type.node_value as varchar(128))
--
-- ds_co: get the columnObject childelement of the DataSource element.
--        Also, get the schemaName and columnObjectName attributes of that columnObject-element. 
--
inner join :tab_dom                            ds_co
on         ds.node_id                        = ds_co.parent_node_id
and        1                                 = ds_co.node_type
and        'columnObject'                    = ds_co.node_name
inner join :tab_dom                            ds_co_schemaName
on         ds_co.node_id                     = ds_co_schemaName.parent_node_id
and        2                                 = ds_co_schemaName.node_type
and        'schemaName'                      = ds_co_schemaName.node_name
inner join :tab_dom                            ds_co_columnObjectName
on         ds_co.node_id                     = ds_co_columnObjectName.parent_node_id
and        2                                 = ds_co_columnObjectName.node_type
and        'columnObjectName'                = ds_co_columnObjectName.node_name
--
-- ds_id: get the id-attribute of the DataSource element.
--
inner join :tab_dom                            ds_id
on         ds.node_id                        = ds_id.parent_node_id
and        2                                 = ds_id.node_type
and        'id'                              = ds_id.node_name
--
-- ds_usage: find any attributes that refer to the id of the DataSource 
--
inner join :tab_dom                            ds_usage
on         'node'                            = ds_usage.node_name
and        2                                 = ds_usage.node_type
and        '#'||ds_id.node_value             = cast(ds_usage.node_value as nvarchar(128))
--
-- ds_mapping: find any mapping child elements of the node that references the DataSource 
--
inner join :tab_dom                            ds_usage_mapping
on         'mapping'                         = ds_usage_mapping.node_name
and        1                                 = ds_usage_mapping.node_type
and        ds_usage.node_id                  = ds_usage_mapping.parent_node_id
--
-- ds_mapping_source: get the source of the mapping elements. These are our base column names.
--
inner join :tab_dom                            ds_usage_mapping_source
on         'source'                          = ds_usage_mapping_source.node_name
and        2                                 = ds_usage_mapping_source.node_type
and        ds_usage_mapping.node_id          = ds_usage_mapping_source.parent_node_id
where      ds.node_type                      = 1
and        ds.node_name                      = 'DataSource'

Putting it all together

To recapitulate, we discussed
  1. How to do general queries for dependencies using OBJECT_DEPENDENCIES, but that you need to query _SYS_REPO.ACTIVE_OBJECTCROSSREF to find out which Attribute views are used by Analytic views.
  2. How to find the model XML code underlying our information views from the _SYS_REPO.ACTIVE_OBJECT table.
  3. How to parse XML, and how to query the parse tree for elements and attributes
  4. How the XML documents for information views are structured, and how to find base columns used in their models

With all these bits and pieces of information, we can finally create a procedure that fullfills the original requirement to obtain the base columns used by our information views. This is available as the p_get_view_basecols stored procedure. Here is its signature:

create PROCEDURE p_get_view_basecols (
  -- package name pattern. Used to match packages containing analytic, attribute or calculation views. Can contain LIKE wildcards.
  p_package_id    nvarchar(255)
  -- object name pattern. Used to match name of analytic, attribute or calculation views. Can contain LIKE wildcards.
, p_object_name   nvarchar(255) default '%'
  -- object suffix pattern. Can be used to specify the type of view. Can contain LIKE wildcards.
, p_object_suffix nvarchar(255) default '%'
  -- flag to indicate whether to recursively analyze analytic, attribute or calculation views on which the view to be analyzed depends. 
  -- 0 means only look at the given view, 1 means also look at underlying views.
, p_recursive     tinyint default 1
  -- result table: base columns on which the specified view(s) depends.
, out p_cols table (
    -- schema name of the referenced base column
    schema_name nvarchar(128)
    -- table name of the referenced base column
  , table_name  nvarchar(128)
    -- column name of the referenced base column
  , column_name nvarchar(128)
    -- list of view names that depend on the base column
  , views       nclob
  )
)

Obtaining the list of base columns on which our application depends is now as simple as calling the procedure, like so:

call p_get_view_basecols(
  -- look in our application package (and its subpackages)
  'our.application.package.%' 
  -- consider all information views
, '%'
  -- consider all types of information views
, '%'
  -- consider also information views upon which our information views depend
, 1
  -- put the results into our output table
, ?
);

Finally

I hope you enjoyed this post! Feel free to leave a comment to share your insights or to give feedback.

Please note that all source code for this topic is freely available as open source software in our just-bi/hades github reposiory. You are free to use, modify and distribute it, as long as you respect the copyright notice.

We welcome contributions! You can contribute in many ways:

  • Simply use the procedures. Give us feedback. You can do so by leaving a comment on this blog.
  • Spread the word: tell your colleagues, and maybe tweet or write blog post about it. Please use hashtag #justbihades
  • Share your requirements. Create an issue to ask for more features so we can improve our software.
  • Fork it!. Send us pull requests. We welcome your contribution and we will fully attribute you!

Thursday, September 22, 2016

SAP UI5: Internationalization for each view - Addendum for Nested Views

After writing my previous post on SAP UI5: Per-view Internationalization, I found out that the solution does not work completely as intended when using nested views.

If you're using nested views, each view would still have its own set of unique texts that are entirely specific to just that view, and for those cases, the solution as described still works. But there might also be a number of texts that are shared by both the outer and one or more of the inner views. It would make sense to be able to define those texts in the i18n model at the level of the outer view, and have the i18n models of the nested view pick up and enhance the i18n model of the outer view.

Problem: onInit() is not the right place to initialize the i18n model

The problem with the original solution is that the onInit() method of the nested views gets called before that of the outer view. It makes sense - the whole can be initialized only after its parts have been initialized. But this does mean that the onInit() method is not the right place to initialize the i18n model.

Please consider these lines from the _initI18n() method that I proposed to initialize the i18n model:

      //Use the bundledata to create or enhance the i18n model
      var i18nModel = this.getModel(i18n);
      if (i18nModel) {
        i18nModel.enhance(bundleData);
      }
      else {
        i18nModel = new ResourceModel(bundleData);
      }

      //set this i18n model.
      this.setModel(i18nModel, i18n);

Suppose this code runs as part of a nested view's onInit(). The call to getModel() will try to acquire the i18n model that is already set, or else the i18n model of the owner component. That's how the getModel() method in the base controller works (please see my previous blog post to review that code).

Now, at this point, no i18n model has been set for the view, and so the owner component's i18n model will be picked up. The i18n model of the outer view will however never be found, since the onInit() of the controller of the outer view has not been called yet (and therefore, its _initI18n() has not been called either).

Solution: Use onBeforeRendering() rather than onInit()

It turns out that this can be solved by calling the _initI18N() method in the onBeforeRendering() method rather than in the onInit() method. While nested views are initialized before initializaing the outer view, it's the other way around for the rendering process. This makes sense: as the outer view is being rendered, this requires rendering of its containing views. So the onBeforeRendering() method of the outer view will be called before the onBeforeRendering() method of its nested views. (It's the other way around for onAfterRendering(): outer views will be done rendering after its containing views are rendered).

Ensure i18n initialization occurs only once

There is one extra consideration in moving the i18n initialization from the onInit() to onBeforeRendering(). The reason is that views may go trough multiple rendering cycles, whereas the onInit() will only run once. If there are repeated rendering cycles, we do not want to reinitialize the i18n model, so we add a lock that ensures the i18n model is initialized only once:

    ...
    onInit: function(){
      this._initI18n();
    },
    onBeforeRendering: function(){
      this._initI18n();
    },
    _i18nInitialized: false,
    _initI18n: function(){
      if (this._i18nInitialized === true) {
        return;
      }
      var i18n = "i18n";

      //create bundle descriptor for this controllers i18n resource data
      var metadata = this.getMetadata();
      var nameParts = metadata.getName().split(".");
      nameParts.pop();
      nameParts.push(i18n);
      nameParts.push(i18n);
      var bundleData = {bundleName: nameParts.join(".")};

      //Use the bundledata to create or enhance the i18n model
      var i18nModel = this.getModel(i18n);
      if (i18nModel) {
        i18nModel.enhance(bundleData);
      }
      else {
        i18nModel = new ResourceModel(bundleData);
      }

      //set this i18n model.
      this.setModel(i18nModel, i18n);
      this._i18nInitialized = true;
    },
    ...

Overriding onBeforeRendering() in extensions of the base controller

And of course, when extending the base controller, you'll need to remember to call the onBeforeRendering() method of the ascendant when overriding the onBeforeRendering() method:

sap.ui.define([
  "just/bi/apps/components/basecontroller/BaseController"
], function(Controller){
  "use strict";
  var controller = Controller.extend("just.bi.apps.components.mainpanel.MainPanel", {
    onBeforeRendering: function(){
      Controller.prototype.onBeforeRenderingcall(this);
      ...
    }
  });
  return controller;
});

Finally

I hope you enjoyed this addendum. Feel free to share your insights if you think there is a better way to do handle i18n.

Sunday, September 18, 2016

SAP UI5: Per-view Internationalization

NOTE: There is an addendum to this blog post that suggests a number of improvements. You can check out the addendum here: SAP UI5: Internationalization for each view - Addendum for Nested Views.

Quite recently, I dove into SAP UI5 development. To educate myself, I followed a workshop and I used the Walkthrough.

During my explorations, I ran into a particular issue which I didn't see very readily addressed. I also found a solution for this particular issue, and even though I still have a ton to learn, I think it is worth sharing. So, here goes:

The Feature: Translatable texts and the i18n model

One of the SAP UI5 features highlighted in the Walkthrough is treatment of translatable texts. In the walkthrough this is realized by setting up a resource model, the i18n model.

The i18n model is sourced form i18n .properties files, which are essentially lists of key/value pairs, one per line, and separated by an equals sign:

# Each line is a key=value pair.
greetingAction=Say Hello
greeting=Hello {0}!

To actually setup a i18n model using these texts, you can explicitly instantiate a sap.ui.model.resource.ResourceModel:

sap.ui.define([
    "sap/ui/core/mvc/Controller",
    "sap/ui/model/resource/ResourceModel"
], function(Controller, ResourceModel){
    "use strict";
    return Controller.extend("my.app.App", {
        onInit: function(){
            var i18nModel = new ResourceModel({
                bundleName: "just.bi.apps.JustBiApp.i18n.i18n"
            });
            this.getView().setModel(i18nModel, "i18n");
        }
    });
});

Or, you can have your application instantiate the model by listing it in the models property of the sap.ui5 entry in the manifest.json application descriptor file:

"models": {
 "i18n": {
   "type": "sap.ui.model.resource.ResourceModel",
    "settings": {
      "bundleName": "just.bi.apps.JustBiApp.i18n.i18n"
    }
  }
}

In many cases, the text is required for the static labels of ui elements like input fields, menus and so on. Inside a view, static texts may be retrieved from the i18n model through special data binding syntax, like so:

<-- Button text will read "Say Hello" -->
<Button text="{i18n>greetingAction}"/>

Texts may also be retrieved programmatically inside controller code by calling the .getText() method on the resource bundle object. The resource bundle object is may be obtained from the i18n resource model with the getResourceBundle() getter method:

var bundle = this.getModel("i18n").getResourceBundle();
var text = bundle.getText("greeting", ["World"]);      // text has value "Hello, World!"

Now, the cool thing is that you can write a separate i18n .properties file for each locale that you want to support. The framework discovers which locale is required by the client and use that to find the best matching i18n files appropriate for the client's locale.

The file name is used to identify to which language and/or locale the texts inside the file apply. For example, you'd put the German texts in a i18n_de.properties file, and the English texts in a i18n_en.properties file, and if you want to distinguish between British and American English, you'd create both a i18n_en_GB.properties and i18n_en_US.properties file.

(I haven't found out to exactly which standard the sap ui i18n .properties files apply, but from what I've seen so far I think it's safe to assume that you can use the two-letter lowercase ISO 639-1 code for the language and the two-letter uppercase ISO 3166-1 code for the country)


The Problem: One i18n Model for entire application

Now, the walkthrough demonstrates the feature by adding one i18n model for the entire application so that it becomes available in any of the views that make up the application. I appreciate that the walkthrough is not the right place to cover all kinds of more advanced scenarios, so I can understand why it settles for just one application-wide i18n model.

However, I can't help but feeling this is not an ideal approach. Main reason is that it seems at odds with the fact that many texts are specific to just one particular view. This challenges both the development workflow as well as the reusability of our application components:


  • Each time you create or modify a particular view, you also have to edit the global i18n .properties files. To keep things manageable, you will probably invent some kind of view-specific prefix to prefix the keys pertaining to that view, and you'll probably end up creating a view-specific block in that i18n file. At some point, you'll end up with a lot of lines per i18n file, which is not so maintainable
  • Suppose you want to reuse a particular view in another application. Contrary to the practice used in the Walthrough, I like to keep view and associated Controller together, and in a folder separate from any other view and controller. This way I can easily copy, move or remove the things that belong together. Except that the texts, which also belong to that view/controller, are in the global i18n .properties file, and need to be managed separately.

The Solution: Keep per-view i18n files near view and controller code

The solution I found is to create a i18n subfolder beneath the folder that contains my Controller and View. Since I already keep each associated view and controller together, and separate from the other views and controllers, this approach makes sense: It's just one step further in keeping code and resources that depend directly on each other physically together.


So, this is what my file and folder structure looks like:

FolderStructure.png

So, the webapp folder is the root of the sap ui5 project. The components folder is where I keep subfolders for each functional unit (i.e. View+Controller+resources) of the application. In the picture, you see two such subfolders, basecontroller (more about that below) and mainpanel.

The mainpanel folder is the one that contains an actual component of my application - a MainPanel View, and its controller (in MainPanel.view.xml and MainPanel.controller.js respectively). Here we also find the i18n folder specific to this view, and inside are the i18n .properties files (one for each locale we need to support).

In order to load and apply the view-specific i18n .properties files, I'm using generic extension of sap.ui.core.mvc.Controller which loads the "local", view-specific i18n resource bundle. This extension is called BaseController and is in the basecontroller folder. Here's the code:

sap.ui.define([
  "sap/ui/core/mvc/Controller",
  "sap/ui/model/resource/ResourceModel"
], function(Controller, ResourceModel){
  "use strict";
  var controller = Controller.extend("just.bi.apps.components.basecontroller.BaseController", {
    onInit: function()
      this._initI18n();
    },
    _initI18n: function(){
      var i18n = "i18n";

      //create bundle descriptor for this controllers i18n resource data
      var metadata = this.getMetadata();
      var nameParts = metadata.getName().split(".");
      nameParts.pop();
      nameParts.push(i18n);
      nameParts.push(i18n);
      var bundleData = {bundleName: nameParts.join(".")};

      //Use the bundledata to create or enhance the i18n model
      var i18nModel = this.getModel(i18n);
      if (i18nModel) {
        i18nModel.enhance(bundleData);
      }
      else {
        i18nModel = new ResourceModel(bundleData);
      }

      //set this i18n model.
      this.setModel(i18nModel, i18n);
    },
    getModel: function(modelname){
      var view = this.getView();
      var model = view.getModel.apply(view, arguments);
      if (!model) {
        var ownerComponent = this.getOwnerComponent();
        if (ownerComponent) {
          model = ownerComponent.getModel(modelname);
        }
      }
      return model;
    },
    setModel: function(model, modelName){
      var view = this.getView();
      view.setModel.apply(view, arguments);
    },
  });
  return controller;
});

Note how the BaseController initializes the i18 model by calling the _init118n() method. In this method we extract the className of the this object from its metadata (using the .getName() getter on the metadata obtained using the .getMetadata() getter), and we pop off the unqualified classname to obtain its namespace. We then add the string "i18n" twice - once for the folder, and once for the files inside it. We use this to create the bundlename which we use to instantiate the actual ResourceModel.

Before setting that model to the controller's view, we check if there is already an i18n model set using getModel(). This is a utility method that gets the model from this controller's associated view, or of the component that "owns" the view and this controller.

If a i18n model is already available, we enhance that by calling the .enhance() method on it, rather than replacing it. This way, any texts defined at a higher level are still available in this controller and view. This gives us a functional i18n model, which we then set using setModel(), which simply calls setModel() on the view associated with this controller.

To actually use this BaseController, we extend it when creating a "real" controller:

sap.ui.define([
  "just/bi/apps/components/basecontroller/BaseController"
], function(Controller){
  "use strict";
  var controller = Controller.extend("just.bi.apps.components.mainpanel.MainPanel", {
    onInit: function(){
      Controller.prototype.onInit.call(this);
      ...
    }
  });
  return controller;
});

Note that if that real controller has its own onInit() method, we need to first call the onInit() method of BaseController, or rather, of whatever class we're extending. Fortunately, since we are extending it we already have a reference to it (in the snippet above, it's the Controller parameter injected into our definition), so we call its onInit() method via the prototype while using the controller that is currently being defined (this) as scope.

Finally

I hope you enjoyed this article! I hope it will be of use to you. If you have an alternative solution - quite possiby, a better one - then please feel free to leave a comment and point out your solution. I'm eager to hear and learn so don't hesitate to share your opinion and point of view.

ADDENDUM: Nested views

It turns out the approach described in this post does not work well when using nested views. Fortunately, a simple improvement of the ideas in this post solves that problem. The approach is described in my next blog post, SAP UI5: Internationalization for each view - Addendum for Nested Views.