Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
11
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

Data Operability: An aspect of () data quality: a level of data record ability to be used directly, without additional processing (restructuring, conversion, etc.).

Data Precision: An aspect of numerical () data quality: the maximum error between a real parameter and its value given by the data, caused by the data values’ discretization. Data precision is inversely proportional to this error.

Data Quality: A set of data properties (features, parameters, etc.) describing their ability to satisfy the user’s expectations or requirements concerning data using for

Data Quality Assessment

information acquiring in a given area of interest, learning, decision making, etc.

Data Relevance: An aspect of () data quality: a level of consistency between the () data content and the area of interest of the user.

Data Structure: Formal description of a () composite data indicating the order, contents, lengths and lists of attributes’ values of its fields.

Simple Data: Data consisting of its identifier and a field describing a single attribute (parameter, property, etc.) of an object.

120

TEAM LinG

Data Warehouses

Antonio Badia

University of Louisville, USA

INTRODUCTION

Data warehouses (DW) appeared first in industry in the mid 1980s. When their impact on businesses and database practices became clear, a flurry or research took place in academia in the late 1980s and 1990s. However, the concept of DW still remains rooted on its practical origins. This entry describes the basic concepts behind a DW while keeping the discussion at an intuitive level. The entry is meant as an overview to complement more focused and detailed entries, and it assumes only familiarity with the relational data model and relational databases.

BACKGROUND

Databases in the 1970s and 1980s were used mainly to maintain data for everyday operations. A typical example is a bank database holding information about accounts that is used by a network of ATM machines. This is called operational data, and the role of the database is mainly to support transactions. Transactions are operations that access and change the data in the database; in the example of the bank, an ATM may access the database to check if a client has enough money in her/his account, and to change the balance if a withdrawal or deposit takes place. While transactions usually affect only small parts of the database, databases have to handle efficiently a large number of transactions, many times concurrently. Thus, the database architecture is geared towards efficient support of small, localized access. This is called on-line transaction processing, or OLTP. However, in the mid 1980s, emphasis shifted towards comprehensive analysis of current and historical data, in order to understand business patterns. This analysis is high level and involves many related low-level items and has as its goal to support decision making. Hence, this kind of analysis is called decision support (DS), but the term on-line analytical processing, or OLAP, is also used, to emphasize the differences with OLTP (Kimball & Strehlo, 1995). The typical DS task involves summarizing large amounts of low-level data and relating different aspects of the business to find interesting correlations, and therefore database access is based on complex queries. Because

121

D

of the business intelligence it provides, OLAP grew rapidly and nowadays is a necessary tool for most medium and large size enterprises.

However, the analysis that DS relies upon is made complicated by a historical factor. In many organizations, especially those of a certain size and complexity, data was stored in several separated data sources, because each unit within the organization developed its own database to support its informational needs. Decision support requires that all the data in the different databases is put together to yield a global picture of the organization. As an example, imagine a hospital where the surgery unit has developed a database of patients undergoing surgery (with information on the type of surgery), Pharmacy has developed a database of patients that are administered some medication (with information on the dosage, time, etc.), and accounting has developed a database of patients in order to bill them. If the hospital’s management asks for an analysis that correlated patient’s socioeconomic status with type of surgery and amount of medication prescribed, the information needed is dispersed throughout all three databases. If patients are identified by social security number in one database, patient ID in another, and a combination of name and date of birth in another, putting the information together may be complicated. Thus, many companies decided to centralize and consolidate all their data in one central repository: a DW. In the following, we describe the main characteristics of a DW, its design and implementation.

DATA WAREHOUSING

Data Warehouse Design:

The ETL Process

A DW contains a copy of the data from other databases, which are called the data sources of the warehouse. In order to get the data from those data sources (usually OLTP databases) and allow them to continue working normally, the data from the sources is copied inside the DW at regular, preestablished intervals, to refresh the DW. Because there are multiple data sources, it is necessary to watch out for redundant (duplicate) data,

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

missing data, or heterogeneous data. This is done during the extraction, Translation, and loading (ETL) process (Immon, 2002; Kimball & Ross, 2002).

The ETL process involves extraction, transformation, integration, cleansing, loading and computation of additional data (different authors give somewhat different phases, or name them differently). Extraction refers to the act of capturing the data from the sources and physically sending it to the DW. Usually, standard external interfaces supported by the databases (called gateways) are used. To make the process efficient, it is desirable to keep track of which data is copied in the DW on a given extraction, so that only new data is copied on the next extraction.

Different databases (especially historical ones, or legacy systems) may contain data which is represented in different formats. Thus the transformation step, which involves converting data to a uniform format, removing, adding and reordering attributes if necessary (e.g., adding a key, or a timestamp).

Because different databases were created by different people for different purposes, it is likely that they contain related (or overlapping) information stored in different ways. When the data is integrated, there may be problems of homonyms (i.e., the same name for different concepts), synonyms (i.e., two names for the same concept), and many other semantic mismatches. Thus, another step in extraction is to integrate the data, that is, to merge and match data to ensure that data about the same entity is integrated and data about different entities is separated (our previous hospital example provides an instance of such semantic mismatches). A separate step is data cleansing, which involves making sure that the data is as devoid of noise as possible; typos, data entry errors, missing data are dealt with at this stage.

Finally, the data must be loaded into the DW. This is a complicated task because, in normal operating mode, the DW is queried and the data is not changed to maximize performance answering queries. Thus, the data in the DW is considered static and left to be out of sync for the time between refreshes. Then, a large amount of data must be loaded in the DW at once to refresh it (batch load); this is called DW refreshing. The window of opportunity depends on DW usage. At loading time, other activities that will help improve query processing must be carried out (e.g., sorting, summarizing, indexing, partitioning, checking integrity constraints). As a result, loading the DW is a complex process in itself. Because of large volumes and shrinking time windows, most commercial utilities use incremental approaches, in which old results are reused and only truly new data is added (Jarke, Lenzerini, Vassiliou, & Vassiliadis, 2000).

To manage the DW, a metadata repository is created. This is a system catalog that contains metadata

Data Warehouses

(i.e., information about the data in the DW, including their origin, format, intended meaning, range, and information about the source of the data). The catalog is usually large and complex, because it used to support all steps in the ETL process. Thus, it is not surprising that sometimes the metadata repository is kept in a database itself, and that building it is one of the most difficult and important steps in the process of building a DW.

Currently, ETL is often done off-line, by hand, and by replicating almost everything. There are tools to help with this task: data migration tools allow simple transformation rules to be applied to the data (“replace string gender with string sex”); data scrubbing tools use domain-specific knowledge to do the cleansing (“ages should be between 18 and 65”). They often use parsing and fuzzy matching techniques to add some intelligence to an otherwise complex and laborious process. The tools that different systems offer to handle ETL vary enormously from system to system (IBM, 2004; Oracle, 2004).

Data Warehouse Design:

Logical and Physical Models

The basic DW design is influenced not by theories of normalization, as in regular databases, but by the fact that the DW goal is to provide information about a business model.

The data in a DW is analyzed in terms of facts and dimensions. A fact is a basic business datum, contains raw data, and is irreducible (e.g., point-of-sale information: who bought what, when, where and at what price is the basic fact on a retail enterprise). A dimension is an attribute of the fact (e.g., Client, Product, Time, and Store are dimensions of the point-of-sale fact). Usually, there are several dimensions to every fact. Each dimension can have in turn a set of associated attributes. For instance, Store may have attributes Name, City, and State associated with it. It is also typical that the values of a dimension can be organized in a hierarchy. The typical example is the dimension Time, which can be organized in Date, Week, Month, Quarter and Year. The previous example, the dimension Store, can also be organized in a geographical hierarchy through City and State. Finally, besides dimensions, a fact contains measures, usually numerical values indicating some properties of the fact. For instance, in our point-of-sale example, Price, Quantity and Amount may be three dimensions. Obviously, these definitions are informal; what is fact and what is a dimension depends on the DW subject and on the business analysis that the DW is to support (hence the usual statement that DW design is subject oriented).

The basic data model choices are relational OLAP (ROLAP) and multidimensional OLAP (MOLAP).

122

TEAM LinG

Data Warehouses

ROLAP uses relational technology to support the DW design. In ROLAP, information about the dimension values is maintained in the dimension tables. Usually, there is one dimension table per dimension. In our example, Store, Client, and Product may have one table each. Information about facts is organized in a table, called the fact table. Each row contains one fact, which is represented by references to the dimensions (technically, one foreign key for each dimension table) and by measures. Thus, rows in a fact table tend to be “narrow” and contain a mix of foreign keys and raw numeric data. This table is typically very large since there are many particular facts to be stored. In the example, a large chain of stores will have millions of point-of-sale transactions over a period of several months; the longer the period, the more sales facts. Also, the fact table tends to grow quickly (there will be many daily sales that must be added to this table). On the other hand, dimension tables tend to be much smaller (orders of magnitude) than the fact table, and tend to change at a much slower rate. Clearly, there is going to be one row per store in the table Store, and that is not going to add up to millions of rows, as in the fact table. Also, a new row will not be added unless a new store is opened.

When all the information about a dimension is stored in one table, the resulting schema is called a star schema (picture the large fact table in the middle and the smaller dimension tables around it). In some cases, this design leads to a high level of redundancy because dimension tables may be denormalized (i.e., not in a normal form as they would be in the usual relational database design; Date, 2000). The dimension tables can be normalized by splitting the information in them in several tables; the resulting design is called a snowflake schema (as before, picture a large fact table in the middle surrounded by dimension tables. But now, each dimension table in turn may be surrounded by a number of smaller tables). Because the fact table is large to start with, it is usually normalized, to avoid redundancy that would increase its size even more. Technically, each dimension table has a primary key, which is also included in the fact table as a foreign key. The combination of all foreign keys (one per dimension) becomes the primary key of the fact table.

The dimension table stores hierarchies by using the elements of the hierarchy as attributes; for instance, the table Time may have attributes Date, Week, Month, Quarter, and Year in every row. A combination of such values tells us where in the hierarchy we are.

Besides these tables, the DW may contain summarized or aggregate tables. These are tables that contain summarized information (i.e., facts that have been grouped by some of the dimensions). For instance, we may have a table that summarizes sales by store, by city, or by state.

In MOLAP, facts with n dimensions are organized in

an n-dimensional cube. For instance, the dimensions of D a sale measure may be store, product and time. Thus, the information can be organized in a three-dimensional

cube. All the dimensions together are assumed to uniquely determine the measure (a particular store, product and time give us a unique sale). Thus, store i, product j and time l give a cell [i][j][k] that has as content the measure(s) for that sale. The measures themselves are usually numerical (e.g., amount of sale). Clearly, each cell in the cube corresponds to a row in a fact table. As before, dimensions are described by a set of attributes (store may be described by store ID, manager, city, state); and some attributes may be organized in hierarchies (city and state form a geographical hierarchy). The cube is stored as an n-dimensional array in the computer. This representation has a great problem with sparseness, the fact that most combinations of dimensions do not have an associated measure (i.e., not all products are sold at all stores at all times); several compression schemes are used in this model to avoid having to store large, mostly empty arrays. This model is influenced by the success of spreadsheet programs in business analysis. However, most DWs nowadays use the ROLAP approach, since it allows leveraging all the know-how and software already existing in relational database systems.

The DW design process is a complex one. Different authors use different labels and recommend different steps to accomplish DW design (Immon, 2002; Kimball & Ross, 2002); we show here a generic framework that encompasses all necessary steps:

1.Define a common business model: agree on subjects (clients, products, etc.), their attributes and properties (time granularity, needed information, etc.), relationships among them, definition of basic terms (Can a client be also a provider? What is a month: 4 weeks or a calendar month?). Every stakeholder has to agree on this model, which makes this step a vast and complex effort.

2.Logical design: Choose central facts and dimensions. For each fact, identify measures. For each dimension, identify attributes and granularity.

3.ETL design: Find source(s) for each datum (e.g, attributes, measures). Design and implement the ETL process. Design and populate the metadata repository.

4.Physical design: Choose auxiliary data to store (e.g., aggregates to precompute, indices needed, data placement, partitioning).

123

TEAM LinG

5.Define front-ends for different users, giving them direct (SQL) access or other interfaces.

6.Optional: Design one or more data marts. A data mart is a database containing a subset of the DW. Usually created with a strong business focus, it yields a database smaller (and more manageable) than the DW for a group of users that have concrete, narrow, business needs. Note that the data mart is dependent on the DW for its contents, just like the DW is dependent on several data sources. Data marts are usually refreshed whenever the DW is refreshed; therefore, the ETL process has to take into account any existing data marts.

Querying the Data Warehouse

The key idea of DW is to collect data in advance of queries. Thus, typically DWs contain consolidated data from many sources. Also, the data in the DW usually spans long periods of time (historic data is needed to analyze trends over time). Moreover, the data is time sensitive, since the DW keeps history of many data (sets of snapshots), which are updated periodically. Hence, most data in a DW is time-stamped. Also, DWs tend to keep raw (low-level) data because different users require different levels of analysis and information granularity. Clearly, low-level data can be summarized to produce high-level analysis, but a high-level summary cannot produce low-level data.

All these factors together explain why DWs are one or several orders of magnitude larger than traditional OLTP databases. Although OLTP databases exist in the megabyte to (a few) gigabyte range, DWs usually live in the terabyte range. Because queries to the DW are posed by upper management, which needs the answer to take decisions in a timely manner, and the queries involve complex analysis of large amounts of data, answering those queries very quickly is both a necessity and a challenge. Hence, a great deal of the implementation of a DW is geared towards this task. For lack of space, we will not discuss here implementation techniques; we will simply point out that, special performance techniques, beyond what is usually available in a database management system, are necessary in a DW environment (Sasha & Bonnet, 2002).

Normally, one queries the DW by asking for values of measures for some specific values of dimensions, for instance sales by store by month or sales of clothing articles by state. Some of the more common operations of DW follow:

Aggregating: A measure over one or more dimensions, that is, finding an aggregate over a value of a dimension (e.g., total sales per week, state).

Data Warehouses

When we aggregate over the hierarchy of a dimension the operation is called roll-up. For instance, given the Store dimension, and the total sales per store, asking for the total sales per state is a rollup. The inverse is drill-down: Given summarized information, ask for more detail. Thus, given total sales by state, asking for total sales by store is a drill-down.

Pivoting: That is, taking a tabular representation of the data and turning it into an n-dimensional chart (spreadsheet) where the labels in the axes are the values of the original table. For instance, the fact table Sales could be pivoted by Time and Store, to give a two-dimensional table, with Time in one axis and Store on the other, and the information for sales on the cells (i.e., each cell contains the sales of a certain store on a certain period). As in a spreadsheet, we would have partial and total sums. The result of pivoting is called cross-tabu- lation. Pivoting can be combined with other operations.

Slicing and Dicing: Slicing is to do a selection (with an equality condition) on one or more dimensions and, dicing is to do a selection (with a range condition) on one or more dimensions. The terminology comes from the MOLAP model, since selecting a single value of a dimension is like taking a slice of the cube, and selecting a range generates a smaller cube, part of the original one.

Time Analysis: The time dimension is especially important in DW. Because trends develop over time, time series, progressions, histories, and so forth are very common and important types of analysis.

The type of queries used in DS environments has highlighted weaknesses of SQL. Therefore, several SQL extensions have been proposed. We highlight some important extensions that have become part of the latest (SQL3) standard (Melton & Simon, 2000).

Extended aggregate functions: Rank, percentile, and cumulative total median are among the new aggregate functions.

Cubes: Because each roll-up corresponds to a single SQL query with grouping, and rolling up n dimensions can be done with any subset of them (so there are 2n possibilities) we have 2n possible SQL queries. Computing these queries is wasteful, since they have a lot in common. A proposed extension to SQL is the Cube (or Data Cube) operator, which is equivalent to a collection of GROUP BYs. A cube is a summary of data by dimensions; for instance, a cube over attributes

124

TEAM LinG

Data Warehouses

(Product, Year, City) would generate aggregates over (Product, Year, City), (Product, Year), (Product, City), (Year, City), (Product), (Year) and (City). The ROLL-UP operator is a specialization of this: the same cube as before would generate aggregates over (Product, Year, City), (Product, Year) and (Product) only.

Window operator: Many interesting business questions arise from the comparison of several groups of items. For instance, a moving average computes averages for different but overlapping sets of numbers. In traditional SQL, the groups of numbers would be formed by using the GROUP BY operator, and could only be non-overlapping and static. The window operator allows the application of aggregates over related, dynamic groups of numbers.

FUTURE TRENDS

There are currently some indications of where DWs will be headed in the next few years. The main ones can be summarized as follows:

Even larger sizes: As the ability to collect and store more information in digital form grows (think of RFID tags, for instance), the size of warehouses will continue to grow. This means we will see more research on query optimization issues (Theodoratos, 2003), parallel architectures, logical and conceptual models that help connect the DW design to its business goals more tightly (Weir, Peng, & Kerridge, 2003) to keep performance up with size.

Incorporation of multimedia data: Just like regular databases have started recently to incorporate data beyond alphanumeric format (e.g., audio, video, free text), so is the push for DWs to deal with these same sources of information.

Connection with the Web: As Web mining is seen as an opportunity to gather information for competitive advantage by business organizations, the warehouses will be used to contain and analyze the enormous amount of information created by e- commerce and other Web-based forms of commerce. New issues that are already being investigated include extending the scope of DW to deal with XML data (Facca & Lanzi, 2003; Vrdoljak, Banek, & Rizzi, 2003).

Connection with data mining: Already the favourite application (because the DW gathers enough information to allow for discovery of pat-

terns and trends), tighter integration is already a

goal of current research (Tsur et al., 1998; Wang, D Yang, & Yu, 2002).

CONCLUSION

Data warehouses are repositories of information that maintain a centralized copy of the contents of several enterprise databases. They make a global, in-depth analysis of the enterprise possible and therefore are of great value to business intelligence. However, DWs are difficult to design and implement. In a DW environment, it is necessary to decide in advance what to store; the integration problems may be very difficult to overcome, and the architecture does not respond well to changes. Due to its strategic value, DWs will continue to be an asset to medium and large size companies in the foreseeable future. Research in DW certainly continues to attract strong interest, with dedicated conferences like DAWAK (Kambayashi, Mohania, & Wob, 2003), DOLAP (DOLAP, 2003) and DMDW (Lenz, Vassiliadis, Jeusfeld, & Staudt, 2003). As stated in the previous section, new areas of research will expand DW scope and will make them, if anything, even more important.

REFERENCES

Bouguettaya, A., Benatallah, B., & Elmagarmid, A. (1998). Interconnecting heterogeneous information systems. Reading, MA: Kluwer Academic Publishers.

Date, C. (2000). An introduction to database systems (7th Ed.). Reading, MA: Addison-Wesley.

DOLAP. (2003). ACM 6th International Workshop on Data Warehousing and OLAP (DOLAP). New Orleans, LA: ACM Press

Facca, F. M., & Lanzi, P. L. (2003). Recent developments in Web usage mining research. DAWAK.

Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). DataCube: A relational aggregation operator generalizing group by, cross-tab, and sub-totals. In Proceedings of the 12th IEEE ICDE. New Orleans, LA: IEEE Computer Society.

IBM. (2004, June). Data warehousing center administration guide [Computer software manual]. Retrieved from h t t p : / / w w w - 3 0 6 . i b m . c o m / s o f t w a r e / d a t a / d b 2 / datawarehouse/

Immon, W. H. (2002). Building the data warehouse

(3rd Ed.). New York: Wiley & Sons.

125

TEAM LinG

Jarke, M., Lenzerini, M., Vassiliou Y., & Vassiliadis, P. (2000). Fundamentals of data warehouses. SpringerVerlag.

Kambayashi, Y., Mohania, M., & Wob, W. (Eds.). (2003). Data warehousing and knowledge discovery. In Proceedings of the 5th International Conference in Data Warehousing and Knowledge Discovery (DAWAK), Prague, Czech Republic.

Kimball, R., & Ross, M. (2002). The data warehouse toolkit (2nd Ed.). New York: Wiley & Sons.

Kimball, R., & Strehlo, K. (1995). Why decision support fails and how to fix it. SIGMOD Record, 24(3), 92-97.

Lenz, H.-J., Vassiliadis, P., Jeusfeld, M., & Staudt, M. (Eds.). (2003). Design and Management of Data Warehouses. Proceedings of the 5th International Workshop,

Berlin, Germany.

Melton, J.. & Simon, A. R. (2002). SQL 1999: Understanding relational language components. San Francisco: Morgan Kaufmann.

Oracle. (2004, June). Oracle9i Warehouse Builder Release 9.2 [Computer software manual]. Retrieved from http:// otn.oracle.com/documentation/warehouse.html

Sasha, D., & Bonnet, P. (2002). Database tuning: Principles, experiments and troubleshooting techniques. San Francisco: Morgan Kaufmann.

Theodoratos, D. (2003). Exploiting hierarchical clustering in evaluating multidimensional aggregation queries. DOLAP.

Tsur, S., Ullman, J. D., Abiteboul, S., Clifton, C., Motwani, R., Nestorov, S., et al. (1998). Query flocks: A generalization of association-rule mining. In Proceedings of the ACM SIGMOD Conference. Seattle, WA: ACM Press.

Vrdoljak, B., Banek, M., & Rizzi, S. (2003). In Kambayashi et al. (Eds.), Designing Web warehouses from XML schemas.

Wang, H., Wang, W., Yang, J., & Yu, P. S. (2002). Clustering by pattern similarity in large data sets. Proceedings of the ACM SIGMOD Conference. Madison, WI: ACM Press.

Weir, R., Peng, T., & Kerridge, J. (2003). Best practice for implementing a data warehouse: A review for strategic alignment. DMDW.

Wolfang, H., Bauer, A., & Harde, G. (2003). Xcube: XML for data warehouses. DOLAP.

Data Warehouses

KEY TERMS

Dimension: A property of a fact that specifies or explains one aspect of said fact. Usually a dimension has information associated with it.

ETL: Extraction-transformation-load is a process by which the DW is (re)populated from data in the data sources. During the process, relevant data is extracted from the source, adequately processed for integration and loaded into the DW.

Fact: Basic, irreducible data item that is stored in the DW. It represents the basic unit of business analysis and therefore must represent the basic activity of the enterprise under consideration.

MOLAP: Multidimensional OLAP is a method of implementing a DW that relies on a multidimensional view of data. Information is seen as a cube in n-dimen- sions; tailored structures are used to store it, and tailored query languages are used to access it.

OLAP: On-line analytical processing in an analysis of data characterized by complex queries that aim at uncovering important patterns and trends in the data in order to answer business questions.

ROLAP: Relational OLAP; method of implementing a DW that relies on relational technology. In ROLAP, the warehouse is implemented as a relational database, using tables to deposit the information and relational technology (SQL) to access it.

Snowflake Schema: Star schema in which the dimension tables have been normalized, possibly leading to the creation of several tables per dimension.

Star Schema: Relational database schema that results from ROLAP; facts and dimensions are stored in tables. The fact table is considered the core of the resulting database, connected to each and every dimension table. When visualized with the fact table in the center, and the dimension tables surrounding it, they create the figure under which the schema is named.

126

TEAM LinG

 

127

 

Data Warehousing and OLAP

 

 

 

D

 

 

 

 

 

Jose Hernandez-Orallo

Technical University of Valencia, Spain

INTRODUCTION

Information systems provide organizations with the necessary information to achieve their goals. Relevant information is gathered and stored to allow decision makers to obtain quick and elaborated reports from the data.

A data warehouse is an especially designed database that allows large amounts of historical and contextual information to be stored and accessed through complex, analytical, highly aggregated, but efficient queries. These queries capture strategic information, possibly presented in the form of reports and support management decision making. As we will see, data warehouses differ from general, transactional databases in many ways.

BACKGROUND

The pristine motivation for building an information system was to integrate the information that would be necessary to support decision making. With the advent of databases and transactional applications, the main goal of information systems drifted towards the organization of data in such a way that software applications work. This drift triggered a specialization of database technology to serve transactional information systems, evolving database design and implementation, as well as database management systems (DBMSs), towards this end. The kind of work performed on these databases was called on-line transactional processing (OLTP).

However, the need for decision-making support did not fade away. On the contrary, more and more organizations were requiring further analytical tools to support decision making: reporting tools, EIS (executive information systems), and other DSS (decision support systems) tools, which many DBMS vendors began to call “business intelligence.” These tools were able to exploit the database in a more powerful way than traditional query tools. These new tools made it easier to aggregate information from many different tables, to summarize data in a much more powerful way, and to construct textual and graphical reports over them, full of condensed, graphical, and statistical information. This other kind of work that was performed on the database was called on-line analytical processing (OLAP).

Although the previous scenarios, working on the same physical database, can still be found in many organizations all over the world, there are three important problems associated with this approach.

First, and most important, is the use of the same database for both OLTP and OLAP. OLAP queries are usually complex, considering much historical data and interweaving many tables with several levels of aggregations. Many OLAP queries are “killer” queries, requiring many resources, and may highly disturb or even collapse the transactional work. Because of this, many reports and complex queries were run at night or during the weekends.

Second, the data stored in the transactional database are the data required only by the applications. Data that are not to be used by any application are not stored. Additionally, many transactional databases remove or simply do not include historical data. On the other hand, adding all the historical data and other sources of information for OLAP work in the same transactional database can be an important overhead storage for the transactional work.

Third, the design and organization of the database is specialized for OLTP: Normalization is common, indexes are created for improving transactions, and so forth. However, these choices might not be good for OLAP operations. This means that even with a separate, devoted database (i.e., a replica), OLAP will not be efficient for large databases.

The previous problems stimulate the construction of separate data repositories, specialized for analytical purposes. In the early nineties, these repositories were called data warehouses and its associated technology data warehousing (Inmon, 1992). The attention widened from enterprises and vendors, but it was not until the late 1990s that the academic world paid attention. All this and the appearance of more mature tools turned data warehousing into a new database discipline on its own.

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

DATA WAREHOUSES

The establishment of data warehousing as a new discipline has much to do with the fact that, once one database (the transactional) has been clearly separated from the other (the historical/analytical), they are quite different kind of “beasts.” Table 1 shows the strong differences between them. The special characteristics of the data and operations in the data warehouse case has led to a specialization in the data warehousing technology, establishing new designing paradigms, adopting other data models, operators, and implementation tools.

Having seen the advantages of constructing a separate data repository, the first question is to determine the data that will be included in the repository. This will surely depend on the analysis of the data requirements for the business intelligence applications that will use the data warehouse, such as the OLAP tools, the DSS systems, and, possibly, data mining. Frequently, this means that part of the information existing in the transactional databases (but not all and not at the same level of detail) has to be loaded into the data warehouse, as well as any external data that could be useful for the analytical processing. Figure 1 illustrates this integration.

Integrating data from several sources, with different formats, data models, and metadata, is not an easy task. Integration from different sources has always a negative effect on data quality; missing data are created because

Table 1. Differences between transactional databases and data warehouses

 

Transactional Database

Data Warehouse

Purpose

Daily operations. Support to the

Information retrieval, reports,

 

software applications.

data analysis.

Data

Data about the organization

Historical data, internal and

Characteristics

inner working, changing data,

external data, descriptive data.

 

internal data, incomplete data.

 

Data Models

Relational, object-relational,

Multidimensional, snowflake,

 

normalized.

partially denormalized.

Users

Hundreds/thousands:

Dozens: managers, executives,

 

applications, operators,

analysts (farmers and

 

administrator, ...

explorers).

Access

SQL. Read and write.

SQL and specific OLAP

 

 

operators (slice & dice, roll,

 

 

pivot, ...). Read-only.

Figure 1. Data warehouses integrate data from internal and external sources

 

Transactional

 

 

 

 

Database

 

Demographic

Climate

 

 

 

 

 

 

Data

Data

 

COUNTRY

CATEGORY

 

 

TRAINING

...

...

 

External

...

 

 

 

 

SALE

 

Internal

Sources

MEETING

...

 

Sources

 

 

 

 

 

...

 

PRODUCT

 

Data

 

PROTOTYPE

...

Needed

 

...

 

Information

Warehouse

Data Warehousing and OLAP

some features exist in some sources but not in others, as well as incompatible formats, units, and so forth. It is extremely important that data are transformed into unified formats and schema, and cleansed from missing and anomalous data. We will review these issues, but next we must address another crucial issue, the way in which information is organized in a data warehouse.

Data Warehouse Organization:

The Multidimensional Model

One idea for constructing a data warehouse can be to apply a traditional design, using an entity-relationship or UML-like conceptual model, and transform it into a relational or an object-relational database, integrating all the sources of information into one highly normalized, off-the-shelf database. However, this has been shown to be a bad way for organizing data in a data warehouse. The reason is that classical data models, and especially the relational model, are well suited for performing transactional work but do not deal well with complex queries with a high degree of aggregation.

As a consequence, the most widespread model for data warehouses is a different data model: the multidimensional model (Golfarelli, Maio, & Rizzi, 1998). Data under the multidimensional model are organized around facts, which have related measures and can be seen in more or less detail according to certain dimensions. As an example, consider a multinational European supermarket chain. Its basic facts are the sales, which can have several associated measures: total income amount, quantity, number of customers, and so forth, and can be detailed in many dimensions: the time of purchase, the product sold, the place of the sale, and so forth. It is enlightening to see that measures usually correspond to the question, how much and how many? whereas the dimensions correspond to the questions when, what, and where.

The interesting thing about the dimensional model is that it eases the construction of queries about facts at several levels of aggregation. For instance, the fact “the sales summed up 22,000 products in the subcategory pet accessories in all the supermarkets in Spain in the first quarter of 2005” represents a measure (quantity = 22,000) of a sale with granularity month for the time dimension (first quarter of 2005), with granularity country for the location dimension (Spain) and with granularity subcategory for the product dimension (pet accessories).

The structure of possible aggregations on each dimension constructs a hierarchy from which we can select any granularity. For instance, as shown in Figure 2, the dimension time is hierarchical, existing several

128

TEAM LinG

Data Warehousing and OLAP

Figure 2. A fact shown in a data cube selected from a hierarchy of dimensions.

D

Sales in

LOCATION:

Portugal

Hierarchy of dimensions:

thousand

country

Italy

 

Germany

 

 

 

 

units

 

Austria

 

PRODUCT

LOCATION

TIME

France

 

 

 

 

Spain

 

 

 

 

 

PRODUCT:

Parfums.

17

Pet Meat.

57

subcategory Fresh vegetables

93

 

Pet Accesories

22

 

Alcoholic drinks

5

Garden and furniture

12

Category

Country

Year

|

|

/

\

Subcat

Prov.

City

Trimester

\

\

/

|

/

\

Article

Supermarket

Month

Week

 

 

 

\

/

12 3 4 1 2

2005 2006

TIME: quarter

FACT:

Day

“The first quarter of 2005, the company sold in Spain

|

22,000 articles in the subcategory of pet accessories.”

Hour

paths from the most fine-grained resolution (hour) to the most coarse-grained resolution (year). The kind of hierarchies that we may have gives several names to the whole: simple star, hierarchical star, or snowflake. The example in Figure 2 is a snowflake, because there are many levels in each dimension and there are alternative paths.

The previous aggregations on the dimensions does not fix the condition in each dimension, it just settles the “data cube” resolution. As we can see in Figure 2, a data cube just specifies at which level of resolution the selection can be made. In the example of Figure 2, we have a data cube with resolution country for the location dimension, subcategory for the product dimension, and quarter for the time dimension. It is in this data cube where we can make a selection for a specific location (Spain), product (pet accessories) and time (first quarter of 2005).

Finally, for many data warehouses it is not possible to organize all the information around one fact and several dimensions. For instance, a company might have one kind of facts and dimensions for sales, another for invoicing, another for personnel. The solution is to create similar substructures for each of these areas. Each substructure is called a data mart. A data warehouse under the multidimensional model is then a collection of data marts.

Exploitation:

OLAP Operators and Tools

A data model comprises a set of data structures and a set of operators over these structures. In the previous section we have seen that the data structures in the dimensional model are the facts with their measures, and the dimensions with their hierarchies and attributes for each level. We have seen that a single “operator” can be defined by just choosing a measure from the fact and a level for each dimension, forming a data cube, and then, selecting the values of one or more dimensions.

Query tools on data warehouses under the multidimensional model usually have a graphical interface, which

allow the user to select the data mart, to pick one or more measures for the facts (the aggregated projection), to choose the resolution in the dimension hierarchies and to express additional conditions.

As we have discussed, there can be more than three dimensions. This would prevent the graphical representation of data cubes in many situations. On the contrary, the relational model (as well as other classical models) has the advantage that the result of a query is a table, which can always be represented on a screen or a paper. The typical solution taken by the query tools for the dimensional model is a hybrid between tables and cubes. For instance, the cube of Figure 2 can be represented in two dimensions, as shown in Table 2.

The idea is that two or more dimensions can be “dragged” to the left or to the top, and we can have an arbitrary number of dimensions in a bidimensional table. This is feasible precisely because we are obtaining highly summarized (i.e., aggregated) information and the number of rows and columns is manageable. These hybrid tables are usually called “reports” in OLAP tools.

With the multidimensional model and this hybrid data cube/table representation in mind it is easier to understand some additional operators that are more and more common in OLAP tools, called OLAP operators:

Drill: de-aggregates the data (more fine-grained data) following the paths of one or more dimensions.

Roll: aggregates the data (more coarse-grained data) following the paths of one or more dimensions.

Slice & dice: selects and projects the data into one of both sides of the report.

Pivot: changes one dimension from one side of the report to the other (rows by columns).

The most important trait of these operators is that they are “query modifiers” (i.e., they are used to “refine”

129

TEAM LinG

Соседние файлы в предмете Электротехника