Basics of Data Structure

Arif Zainurrohman
Nerd For Tech
Published in
8 min readApr 4, 2021

--

Basics of Data Structure

Introduction

The principal tool is normalization, a set of rules for allocating data to tables in such a way as to eliminate certain types of redundancy and incompleteness.

Normalization is usually one of the later activities in a data modeling project, as we cannot start normalizing until we have established what columns (data items) are required.

Normalization is used in the logical database design stage, following requirements analysis and conceptual modeling.

An Informal Example of Normalization

Normalization is essentially a two-step process:

1. Put the data into a tabular form (by removing repeating groups).

2. Remove duplicated data to separate tables.

If we want to store this data in a database, the first task is to put it into tabular form. But we immediately strike a problem: because an employee can have more than one qualification, it’s awkward to fit the qualification data into one row of a table.

Relational Notation

Table

We need to learn more about notation. The normalized model of using the relational notation of table name followed by column names in parentheses.

Determining Columns

We need to do a little preparation and tidying up. Normalization relies on certain assumptions about the way data is represented, and we need to make sure that these are valid. There are also some problems that normalization does not solve, and it is better to address these at the outset, rather than carrying excess baggage through the whole normalization process.

  1. One Fact per Column

First we make sure that each column in the table represents one fact only. Two distinct facts. The three facts should be recorded in separate columns. We will see that this decision makes an important difference to the structure of our final model.

2. Hidden Data

The second piece of tidying up involves making sure that we have not lost any data in the translation to tabular form. The most common problem here is that we cannot rely on the rows of the table being stored in any particular order. Suppose the original survey forms had been filed in order of return.

3. Derivable Data

Remember our basic objective of nonredundancy. We should remove any data that can be derived from other data in the table and amend the columns accordingly.

4. Determining the Primary Key

Finally, we determine a primary key4 for the table. The choice of primary keys is a critical (and sometimes complex) task.

Repeating Groups and First Normal Form

The first normal form (1NF) involves the removal of repeating groups. Examples of such repeating groups are contacts and categories. Thus for a given customer, one or more contacts and one or more categories can exist. For each repeating group, the repeating group is moved to a separate table.

  1. Limit on Maximum Number of Occurrences

One problem with our “repeating group” of data is that we have to set an arbitrary maximum number of repetitions, large enough to accommodate the greatest number that might ever occur in practice.

2. Data Reusability and Program Complexity

You might argue that some inquiries are always going to be intrinsically more complicated than others. But consider what would have happened if we had designed the table on the basis of “one-row per entity”.

3. Recognizing Repeating Groups

At this point, we should also check whether there are any repeating groups that have not been marked as such. To do this, we need to ask whether there are any data items that could have multiple values for a given value of the key.

4. Removing Repeating Groups

A general and flexible solution should not set any limits on the maximum
number of occurrences of repeating groups. It should also neatly handle
the situation of few or no occurrences.

5. Determining the Primary Key of the New Table

Finding the key of the new table was not easy (in fact this is usually the trickiest step in the whole normalization process). We had to ask, “What is
the minimum combination of columns needed to uniquely identify one
row ?”.

6. First Normal Form

First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain.

Second Normal Forms

To appreciate the second normal form, you must understand the idea of functional dependency. A functional dependency is a relationship between or among attributes.

  1. Problems with Tables in First Normal Form

We will need to explicitly handle “last operations” differently, a fairly clear violation of our elegance criterion.

2. Eliminating Redundancy

We can solve all of these problems by removing the data information to a separate table in which each entity number appears once only.

3. Determinants

It is important to understand that this whole procedure of separating data relied on the fact that for a given number there could be only one entity. In fact, we could look at the dependency of data on numbers as the cause of the problem.

Third Normal Form

To appreciate the second normal form, you must understand the idea of functional. Tables in second normal form are especially vulnerable to some types of modification anomalies — in particular, those that come from transitive dependencies. A transitive dependency occurs when one attribute depends on a second attribute, which depends on a third attribute.

  1. What Happened to Second Normal Form?

Our approach took us directly from the first normal form (data in tabular form) to third normal form. Most texts treat this as a two-stage process and deal first with determinants that are part of the table’s key and later with
non-key determinants.

2. Is “Third Normal Form” the Same as “Fully Normalized”?

Unfortunately, no. There are three further well-established normal forms: Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF). In particular, 4NF and 5NF problems usually arise only when dealing with tables in which every column is part of the key.

By the way, “all key” tables are legitimate and occur quite frequently in fully normalized structures. A Sixth Normal Form (6NF) has been proposed, primarily to deal with issues arising in representing time-dependent data.

3. What about Performance? Surely all Those Tables Will Slow Things Down?

There are certainly a lot of tables for what might seem to be relatively little data. There are certainly a lot of tables for what might seem to be relatively little data.

Definitions and a Few Refinements

  1. Definitions and a Few Refinements

Equivalently we can say that the other nominated columns are functionally dependent on the determinant. The determinant concept is what 3NF is all about; we are simply grouping data items around their determinants.

2. Primary Keys

A primary key is a nominated column or combination of columns that has a different value for every row in the table. Each table has one (and only one) primary key.

3. Candidate Keys

All we have done here is to create a second table that will hold exactly the same data as the first — albeit with a different primary key.

To cover this situation formally, we need to be more specific in our rule for which determinants to use as the basis for new tables. We previously excluded the primary key; we need to extend this to all candidate keys.

4. A More Formal Definition of Third Normal Form

The definition of Boyce-Codd Normal Form (BCNF) is even simpler: a table is in BCNF if the only determinants of any columns are candidate keys. The reason that we defer discussion of BCNF is that identifying a BCNF problem.

5. Foreign Keys

Recall that when we removed repeating groups to a new table, we carried the primary key of the original table with us, to cross-reference or “point back” to the source. In moving from first to third normal form, we left determinants behind as cross-references to the relevant rows in the new tables. These cross-referencing columns are called foreign keys, and they are our principal means of linking data from different tables.

6. Referential Integrity

Modern DBMSs provide referential integrity features that ensure automatically that each foreign key value has a matching primary key value.

7. Update Anomalies

Discussions of normalization often refer to update anomalies. The term
nicely captures most of the problems which normalization addresses, particularly if the word “update” is used in its broadest sense to include the
insertion and deletion of data, and if we are talking about structures, which
are at least in tabular form.

8. Denormalization and Unnormalization

As we know, from time to time it is necessary to compromise one data modeling objective to achieve another. Occasionally, we will be obliged to implement database designs that are not fully normalized in order to achieve some other objective (most often performance). When doing this, it is important to look beyond “normalization,” as a goal in itself, to the underlying benefits it provides: completeness, nonredundancy, the flexibility of extending repeating groups, ease of data reuse, and programming simplicity. These are what we are sacrificing when we implement unnormalized or only partly normalized, structures.

9. Column and Table Names

We took our column names from the original report form, and we made up table names as we needed them.

10. Choice, Creativity, and Normalization

Choice and creativity have not featured much in our discussion of normalization so far. Indeed, normalization by itself is a deterministic process, which makes it particularly attractive to teachers; it is always nice to be able to set a problem with a single right answer. The rigor of normalization, and the emphasis placed on it in teaching and research, has sometimes encouraged a view that data modeling as a whole is deterministic.

11. Terminology

Most theoretical work on relational structures uses a different set of terms: relations, attributes, and tuples, respectively. This is because much of the theory of tabular data organization, including normalization, comes from the mathematical areas of relational calculus and relational algebra.

Conclusion

Normalization is a set of techniques for organizing data into tables in such a way as to eliminate certain types of redundancy and incompleteness, and associated complexity and/or anomalies when updating it.

Normalization relies on correct identification of determinants and keys. In this chapter, we covered normalization to third normal form (3NF). A table is in 3NF if every determinant of a nonkey item is a candidate key.

Reference

Data Modeling Theory and Practice Graeme Simsion

First normal form — Wikipedia

SQL First, Second and Third Normal Forms — dummies

--

--

Arif Zainurrohman
Nerd For Tech

Corporate Data Analytics. Enthusiast in all things data, personal finance, and Fintech.