databasegyan: May 2007

Thursday, May 17, 2007

Why Raid?

Why do people think RAID means performance?

George Ou, Technical Director at ZDNet and a fellow ZDnet blogger, has a great post about real life RAID performance - hardware vs software - plus some helpful comments about data layout, especially for MS SQL Server. As George notes, data layout can have a major impact on storage performance. But what about RAID itself? What is the theory behind RAID performance?

RAID = Redundant Array of Inexpensive Disks

RAID wasn’t originally about performance, it was about cost. Some CompSci jocks at Berkeley wanted to use 5.25″ drives for a bunch of cheap storage because they couldn’t afford the then-common 9″ drives. The small drives had poor reliability as well as lower cost, so they cobbled them together to create the first RAID array.

But as they worked out the details, they saw that RAID could have performance advantages for certain workloads.

The three pillars of RAID performance

* Cache
* Striping
* Chunk size

Let’s look at all three.

Cache

Cache is simply RAM, or memory, placed in the data path in front of a disk or disk array. You can read or write RAM about 100,000 times faster than a fast disk and about 300,000 times faster than a slow disk. Writing or reading cache is a lot faster than disk.

Most external array controllers are redundant as well, with some kind of load balancing driver that keeps both controllers productive. In this case, the cache has to be dual-ported. If a controller fails, the other controller can see all the pending writes of the failed controller and completes them.

Dual-ported cache, controller failover, dual server interfaces and failover drivers are all tricky to engineer and test, which is one reason why mid-range and high-end controllers are so expensive. But they sure speed writes up.

Striping for speed

Striping is taking a virtual disk that the operating system sees, and spreading that virtual disk across several real, physical disks.

A RAID controller presents something that looks like a disk, like a C: drive, to the operating system, be it Windows, OS X or Linux. The RAID controller isn’t presenting a real disk. It is presenting a group of disks and making them look like a single disk to your computer.

The advantage is that instead of a single disk’s performance, you now have the performance of several disks. Instead of 50 I/Os per second (IOPS) on a 5400 RPM drive, you might have 150, 200 IOPS or more, depending on the number of drives. If you use fast 15k drives you might reach 900, 1,000 or more IOPS.

And instead of 1.5 gigabits per second bandwidth, you might have 4.5 or 6 Gb/sec. If you have a cache that you are in a hurry to empty, a nice fast stripe set is very helpful.

The hard part: spreading I/Os smoothly across all the disks. If they all jam up on one disk your costly storage system will be no faster than a single disk. That’s where chunk size comes in.

Chunk size: the hidden key to RAID performance

Stripes go across disk drives. But how big are the pieces of the stripe on each disk? The pieces a stripe is broken into are called chunks. Is the stripe broken into 1k byte pieces? Or 1 MB pieces? Or even larger? To get good performance you must have a reasonable chunk size.

So what is a reasonable chunk size? It depends on your average I/O request size. Here’s the rule of thumb: big I/Os = small chunks; small I/Os = big chunks.

Do you do video editing or a lot of Photoshop work? Then your average request size will be large and your performance will be dominated by how long it takes to get the data to or from the disks. So you want a lot of bandwidth to move data quickly. To get a lot of bandwidth you want each disk to shoulder part of the load, so you want a small chunk size. What is small? Anywhere from 512 bytes (one block) to 8 KB.

If you are running a database and doing lots of small I/Os - 512 bytes to 4 KB say - then you want to maximize your IOPS, which ideally means sending each I/O to only one disk and spreading the I/Os evenly across the disks. What you don’t want is a single I/O getting sent to two disks, since waiting for the heads will slow things down. So you want a large chunk size - at least 64 KB or more. That large chunk will mean that most I/Os get serviced by a single disk and more I/Os are available on the remaining disks.

However, many databases use their own strategies to gather I/Os to minimize I/O overhead. In that case you need to know what the database is actually doing to choose the right chunk size.

The Storage Bits take

RAID systems are complex and their operation is sometimes counter-intuitive. You should take the time to understand what your workload is in order to configure a RAID system for good performance. Otherwise you could end up with an expensive and slow problem instead of a fast I/O solution.

Internals Of Database

Storage

Database tables/indexes are typically stored in memory or on hard disk in one of many forms, ordered/unordered Flat files, ISAM, Heaps, Hash buckets or B+ Trees. These have various advantages and disadvantages discussed in further in the main article on this topic. The most commonly used are B+trees and ISAM.

Indexing

All of these databases can take advantage of indexing to increase their speed, and this technology has advanced tremendously since its early uses in the 1960s and 1970s. The most common kind of index is a sorted list of the contents of some particular table column, with pointers to the row associated with the value. An index allows a set of table rows matching some criterion to be located quickly. Typically, indexes are stored in the same various techniques mentioned above (such as B-trees, hashes, and linked lists). Usually, a specific technique is chosen by the database designer to increase efficiency in the particular case of the type of index required.

Relational DBMSs have the advantage that indexes can be created or dropped without changing existing applications making use of it. The database chooses between many different strategies based on which one it estimates will run the fastest. In other words, indexes are transparent to the application or end user querying the database; while they affect performance, any SQL command will run with or without indexes existing in the database.

Relational DBMSs utilize many different algorithms to compute the result of an SQL statement. The RDBMS will produce a plan of how to execute the query, which is generated by analyzing the run times of the different algorithms and selecting the quickest. Some of the key algorithms that deal with joins are Nested Loops Join, Sort-Merge Join and Hash Join. Which of these is chosen depends on whether an index exists, what type it is, and its cardinality.

Transactions and concurrency

In addition to their data model, most practical databases ("transactional databases") attempt to enforce a database transaction . Ideally, the database software should enforce the ACID rules, summarized here:

* Atomicity: Either all the tasks in a transaction must be done, or none of them. The transaction must be completed, or else it must be undone (rolled back).
* Consistency: Every transaction must preserve the integrity constraints — the declared consistency rules — of the database. It cannot place the data in a contradictory state.
* Isolation: Two simultaneous transactions cannot interfere with one another. Intermediate results within a transaction are not visible to other transactions.
* Durability: Completed transactions cannot be aborted later or their results discarded. They must persist through (for instance) restarts of the DBMS after crashes
* A cascading rollback occurs in database systems when a transaction (T1) causes a failure and a rollback must be performed. Other transactions dependent on T1's actions must also be rolled back due to T1's failure, thus causing a cascading effect.
In practice, many DBMS's allow most of these rules to be selectively relaxed for better performance.

Concurrency control is a method used to ensure that transactions are executed in a safe manner and follow the ACID rules. The DBMS must be able to ensure that only serializable, recoverable schedules are allowed, and that no actions of committed transactions are lost while undoing aborted transactions.

Replication

Replication of databases is closely related to transactions. If a database can log its individual actions, it is possible to create a duplicate of the data in real time. The duplicate can be used to improve performance or availability of the whole database system. Common replication concepts include:

* Master/Slave Replication: All write requests are performed on the master and then replicated to the slaves
* Quorum: The result of Read and Write requests are calculated by querying a "majority" of replicas.
* Multimaster: Two or more replicas sync each other via a transaction identifier.

Security

Database security is the system, processes, and procedures that protect a database from unintended activity.

Applications of databases

Databases are used in many applications, spanning virtually the entire range of computer software. Databases are the preferred method of storage for large multiuser applications, where coordination between many users is needed. Even individual users find them convenient, and many electronic mail programs and personal organizers are based on standard database technology. Software database drivers are available for most database platforms so that application software can use a common application programming interface (API) to retrieve the information stored in a database. Two commonly used database APIs are JDBC and ODBC.

Monday, May 14, 2007

Database Models

Various techniques are used to model data structure. Most database systems are built around one particular data model, although it is increasingly common for products to offer support for more than one model. For any one logical model various physical implementations may be possible, and most products will offer the user some level of control in tuning the physical implementation, since the choices that are made have a significant effect on performance. An example of this is the relational model: all serious implementations of the relational model allow the creation of indexes which provide fast access to rows in a table if the values of certain columns are known.

[edit] Flat model
This may not strictly qualify as a data model, as defined above. The flat (or table) model consists of a single, two-dimensional array of data elements, where all members of a given column are assumed to be similar values, and all members of a row are assumed to be related to one another.

[edit] Hierarchical model
In a hierarchical model, data is organized into a tree-like structure, implying a single upward link in each record to describe the nesting, and a sort field to keep the records in a particular order in each same-level list.

[edit] Relational model
Three key terms are used extensively in relational database models: relations, attributes, and domains. A relation is a table with columns and rows. The named columns of the relation are called attributes, and the domain is the set of values the attributes are allowed to take.
The basic data structure of the relational model is the table, where information about a particular entity (say, an employee) is represented in columns and rows (also called tuples). Thus, the "relation" in "relational database" refers to the various tables in the database; a relation is a set of tuples. The columns enumerate the various attributes of the entity (the employee's name, address or phone number, for example), and a row is an actual instance of the entity (a specific employee) that is represented by the relation. As a result, each tuple of the employee table represents various attributes of a single employee.
All relations (and, thus, tables) in a relational database have to adhere to some basic rules to qualify as relations. First, the ordering of columns is immaterial in a table. Second, there can't be identical tuples or rows in a table. And third, each tuple will contain a single value for each of its attributes.
A relational database contains multiple tables, each similar to the one in the "flat" database model. One of the strengths of the relational model is that, in principle, any value occurring in two different records (belonging to the same table or to different tables), implies a relationship among those two records. Yet, in order to enforce explicit integrity constraints, relationships between records in tables can also be defined explicitly, by identifying or non-identifying parent-child relationships characterized by assigning cardinality (1:1, (0)1:M, M:M). Tables can also have a designated single attribute or a set of attributes that can act as a "key", which can be used to uniquely identify each tuple in the table.
A key that can be used to uniquely identify a row in a table is called a primary key. Keys are commonly used to join or combine data from two or more tables. For example, an Employee table may contain a column named Location which contains a value that matches the key of a Location table. Keys are also critical in the creation of indexes, which facilitate fast retrieval of data from large tables. Any column can be a key, or multiple columns can be grouped together into a compound key. It is not necessary to define all the keys in advance; a column can be used as a key even if it was not originally intended to be one.

[edit] Relational operations
Users (or programs) request data from a relational database by sending it a query that is written in a special language, usually a dialect of SQL. Although SQL was originally intended for end-users, it is much more common for SQL queries to be embedded into software that provides an easier user interface. Many web sites, such as Wikipedia, perform SQL queries when generating pages.
In response to a query, the database returns a result set, which is just a list of rows containing the answers. The simplest query is just to return all the rows from a table, but more often, the rows are filtered in some way to return just the answer wanted. Often, data from multiple tables are combined into one, by doing a join. There are a number of relational operations in addition to join.

[edit] Normal Forms
Main article: Database normalization
Relations are classified based upon the types of anomalies to which they're vulnerable. A database that's in the first normal form is vulnerable to all types of anomalies, while a database that's in the domain/key normal form has no modification anomalies. Normal forms are hierarchical in nature. That is, the lowest level is the first normal form, and the database cannot meet the requirements for higher level normal forms without first having met all the requirements of the lesser normal forms.[1]

[edit] Object database models
In recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases. These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program. This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects). At the same time, object databases attempt to introduce the key ideas of object programming, such as encapsulation and polymorphism, into the world of databases.
A variety of these ways have been tried for storing objects in a database. Some products have approached the problem from the application programming end, by making the objects manipulated by the program persistent. This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content. Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities.

[edit] Post-relational database models
Several products have been identified as post-relational because the data model incorporates relations but is not constrained by the Information Principle, requiring that all information is represented by data values in relations. Products using a post-relational data model typically employ a model that actually pre-dates the relational model. These might be identified as a directed graph with trees on the nodes.
Examples of models that could be classified as post-relational are PICK aka MultiValue, and MUMPS, aka M.

Saturday, May 12, 2007

History Of Database

The earliest known use of the term 'data base' was in July 1973, when the System Development Corporation sponsored a symposium under the title Development and Management of a Computer-centered Data Base. Database as a single word became common in Europe in the early 1970s and by the end of the decade it was being used in major American newspapers. (Databank, a comparable term, had been used in the Washington Post newspaper as early as 1966.)

The first database management systems were developed in the 1960s. A pioneer in the field was Charles Bachman. Bachman's early papers show that his aim was to make more effective use of the new direct access storage devices becoming available: until then, data processing had been based on punched cards and magnetic tape, so that serial processing was the dominant activity. Two key data models arose at this time: CODASYL developed the network model based on Bachman's ideas, and (apparently independently) the hierarchical model was used in a system developed by North American Rockwell, later adopted by IBM as the cornerstone of their IMS product. While IMS along with the CODASYL IDMS were the big, high visibility databases developed in the 1960's, several others were also born in that decade, some of which have a significant installed base today. Two worthy of mention are the PICK and MUMPS databases, with the former developed originally as an operating system with an embedded database and the latter as a programming language and database for the development of data-based software.

The relational model was proposed by E. F. Codd in 1970. He criticized existing models for confusing the abstract description of information structure with descriptions of physical access mechanisms. For a longwhile, however, the relational model remained of academic interest only. While CODASYL products (IDMS) and network model products (IMS) were conceived as practical engineering solutions taking account of the technology as it existed at the time, the relational model took a much more theoretical perspective, arguing (correctly) that hardware and software technology would catch up in time. Among the first implementations were Michael Stonebraker's Ingres at Berkeley, and the System R project at IBM. Both of these were research prototypes, announced during 1976. The first commercial products, Oracle and DB2, did not appear until around 1980. The first successful database product for microcomputers was dBASE for the CP/M and PC-DOS/MS-DOS operating systems.

During the 1980s, research activity focused on distributed database systems and database machines, but these developments had little effect on the market. Another important theoretical idea was the Functional Data Model, but apart from some specialized applications in genetics, molecular biology, and fraud investigation, the world took little notice.

In the 1990s, attention shifted to object-oriented databases. These had some success in fields where it was necessary to handle more complex data than relational systems could easily cope with, such as spatial databases, engineering data (including software engineering repositories), and multimedia data. Some of these ideas were adopted by the relational vendors, who integrated new features into their products as a result. The 1990s also saw the spread of Open Source databases, such as PostgreSQL and MySQL.

In the 2000s, the fashionable area for innovation is the XML database. As with object databases, this has spawned a new collection of startup companies, but at the same time the key ideas are being integrated into the established relational products. XML databases aim to remove the traditional divide between documents and data, allowing all of an organization's information resources to be held in one place, whether they are highly structured or not.

Define Database

In computing, a database can be defined as a structured collection of records or data that is stored in a computer so that a program can consult it to answer queries. The records retrieved in answer to queries become information that can be used to make decisions. The computer program used to manage and query a database is known as a database management system (DBMS). The properties and design of database systems are included in the study of information science.

The term "database" originated within the computing discipline. Although its meaning has been broadened by popular use, even to include non-electronic databases, this article is about computer databases. Database-like records have been in existence since well before the industrial revolution in the form of ledgers, sales receipts and other business related collections of data.

The central concept of a database is that of a collection of records, or pieces of knowledge. Typically, for a given database, there is a structural description of the type of facts held in that database: this description is known as a schema. The schema describes the objects that are represented in the database, and the relationships among them. There are a number of different ways of organizing a schema, that is, of modeling the database structure: these are known as database models (or data models). The model in most common use today is the relational model, which in layman's terms represents all information in the form of multiple related tables each consisting of rows and columns (the true definition uses mathematical terminology). This model represents relationships by the use of values common to more than one table. Other models such as the hierarchical model and the network model use a more explicit representation of relationships.

The term database refers to the collection of related records, and the software should be referred to as the database management system or DBMS. When the context is unambiguous, however, many database administrators and programmers use the term database to cover both meanings.

Many professionals consider a collection of data to constitute a database only if it has certain properties: for example, if the data is managed to ensure its integrity and quality, if it allows shared access by a community of users, if it has a schema, or if it supports a query language. However, there is no agreed definition of these properties.

Database management systems are usually categorized according to the data model that they support: relational, object-relational, network, and so on. The data model will tend to determine the query languages that are available to access the database. A great deal of the internal engineering of a DBMS, however, is independent of the data model, and is concerned with managing factors such as performance, concurrency, integrity, and recovery from hardware failures. In these areas there are large differences between products.

databasegyan

Google Search