Virtualization and Cloud Computing

The verdict is in and Data Virtualization (DV) is here to stay. This is driven by many factors such as the maturation of vendor technologies (e.g., heterogeneous join optimization, row level security, and high availability), increases in computing power, and a decrease in time end users are willing to wait for access to data. Over the past few years, companies like Informatica, Composite Software, Denodo Technologies, and IBM have led the industry charge to:

  • Expand data virtualization reach into new industries
  • Enable Agile Business Intelligence (BI) with data virtualization
  • Integrate with Cloud applications, such as Salesforce.com, Workday and MS Dynamics
  • Integrate with Cloud data services, such as Windows Azure, LexisNexis, and data.gov
  • Provide access to a larger variety of data, including video, email, and JSON documents, and other Big Data sources often stored in platforms such as Hadoop
  • Shift from departmental to enterprise-wide deployments

Data Virtualization holds the promise of streamlining many of the challenges that have beleaguered data organizations for decades. Organizations may be attracted to DV for capabilities such as:

  • Provide a single data access point for structured and unstructured data
  • Unify data security across the organization (and outside the organization)
  • Enhance development team agility when embarking on data integration projects
  • Decouple applications and analytics from the physical data structures, allowing data infrastructure changes while minimizing end-user impact
  • Deliver real-time operational data and processed/cleansed data to support up-to-the-minute data requirements
  • Enable federated/heterogeneous joins of data residing in disparate locations/sources
  • Create a bridge between "big data" sources (e.g., those residing in Hadoop) and relational data sources (e.g., those residing in Databases)
  • Any one of the above capabilities may be compelling enough to justify Data Virtualization, but when combined, the potential benefits are undeniable.
    However.

    If data virtualization is deployed without thoughtful planning, it can create challenges with manageability, usability, data quality and performance. Think about how MS Access databases or MS Excel spread-marts have proliferated in many organizations. Without an eye on data architecture and data governance, such technologies can propagate unchecked until organizations second-guess the validity of their data and spend considerable time resolving discrepancies between sources. This risk is similar and even greater with DV.

    With data virtualization, an organization can expose departmental data assets (such as a file or a table) to a much broader audience with greater ease than ever before. For this reason, and others, we are forced to think about data virtualization from an enterprise perspective. Fortunately, this is a challenge that has been explored for decades and many of the concepts of Enterprise Data Management (EDM) are directly transferable to DV.

    ".we are forced to think about data virtualization from an enterprise perspective."

    If you are considering implementing data virtualization, we encourage you to consider the following 8 steps:

    • 1. Architect from an Enterprise Perspective

      Data virtualization solutions need to meet evolving and often dichotomous requirements of users across the organization. Data virtualization development can become less agile, less performance and challenging to manage as more layers and objects are added. The more duplicate business logic and dependencies that exist, the longer the testing cycles and harder it is to troubleshoot performance issues. To mitigate such challenges, make sure to work with your data architecture teams and consider approaches such as:

      • Implement a layered view approach to isolate business logic
      • Consider allowing user access to multiple layers of your data virtualization model
      • Create development standards that includes naming standards and common rules for reusability and layer isolation
      • Maintain your data virtualization models in a case tool, such as ERWin, PowerDesigner or ERStudio
      • Push down as much of the processing as possible to the source system
      • Implement caching strategies to reduce performance bottlenecks
      • Involve your data architecture team to assess anticipated performance challenges, particularly for data sources that are sizable and for scenarios where relatively costly heterogeneous joins are required.
    • 2. Coordinate with your Data Governance organization

      If you currently have a data governance organization (formal or otherwise) it is highly recommended that you socialize data virtualization concepts and capabilities before you get started and leverage any standards, processes, data definitions and business rules that have already been defined. Since data virtualization provides a gateway into corporate data assets, it should be deployed with the cooperation of such an organization, and should be governed as a data asset.

    • 3. Establish usage guidelines and train Development teams

      Organizations are wise to generate guidelines as to when data virtualization technologies should be used to access data versus traditional methods. Since every organization (and their data sources) is unique, there is no single approach that works for all. One reasonable approach is to start the first data virtualization implementation/project by leveraging tool-specific best practices and then incrementally refine the approach over time to best suit the needs of the organization.

      Data developers should be aware of the data virtualization capabilities and, in many cases, should have basic training on the technology. Data virtualization should be considered during any data development initiative. At a minimum, coordination should occur to consider exposing new data assets through the data virtualization platform.

    • 4. Determine organizational responsibilities for the Data Virtualization platform

      Since data virtualization provides the capability to deploy web services, query operational systems and provide integrated data for analysis, organizations often struggle to determine who's responsible for supporting the platform. Rather than trying to assign ownership to one group, consider a matrixed approach across the following groups:

      • Data architects
      • DBA's
      • ETL administrators
      • Application developers
      • Production support
      • System administrators

      With such a matrixed approach, it is helpful to identify one group that has primary ownership and accountability for the administration of the data virtualization software. It is also helpful to identify one group that has primary responsibility for development standards and production migration of data virtualization objects.

    • 5. Coordinate with Information Security

      Data Security should have a strong impact on how data virtualization security is managed (LDAP/AD, table-driven, basic, etc.). Data virtualization makes it much easier to expose a greater breadth of data and data sources to more users.

      • If data is being exposed to new user types, Data Security should determine which types of regulations (e.g., HIPAA, SOX, etc.) might apply.
      • Where applicable, Data Security should also determine if specific columns (e.g., person identifiable information (PII) like SSN) should be masked within the data virtualization tool.
      • Data Virtualization provides the benefit of being able to limit which rows a specific user or user group can see within a table. Information security may leverage this capability.
    • 6. Collaborate with your Data Warehouse/Business Intelligence department

      Data Warehouse/Business Intelligence (DW/BI) teams should be aware of the DV capabilities and, in many cases, should receive training on the technology.

      Data Warehouse/Business Intelligence organizations are wise to generate guidelines as to when DV technologies should be used versus when more traditional methods like ETL should be used. By clarifying when one tool should be used versus another, we reduce/eliminate turf conflicts between organizations and improve uptake of the data virtualization technology.

      Two novel approaches to using data virtualization to increase agility and reduce cost for DW/BI applications include:

      • a) Federated, multi-source data environment: A DV technology may access a data warehouse, in essence mirroring all of the existing consumable tables. DV can then extend this view to include other data sources (see figure below) to create a federated data capability. This can increase team agility and reduce costs associated with physically moving data. Such an architecture should be carefully vetted with a particular focus on performance, join paths, data cleanliness and data completeness across the disparate sources.
      • b) Virtual Data Marts: Another way in which data virtualization may augment DW/BI is by replacing some of the data marts with data virtualization objects (Views). The following figure shows how a traditional DW might feed virtual data marts, all within the DV platform. In the figure, the Enterprise Data Warehouse is still essential. Again, such an approach can increase team agility and reduce costs associated with physically moving data. Such an architecture should be carefully vetted with a particular focus on performance.
    • 7. Expose Data Virtualization metadata to users

      Most data virtualization tools are able to display and export data lineage information. This can be extremely useful to data developers and business users when they need to determine where a specific piece of data came from. Such lineage information is a key piece of metadata that can provide value to an organization. If your organization leverages metadata standards and/or applications, consider how the DV metadata will fit into the overlying strategy.

    • 8. Consider how Data Quality fits

      In some instances, data virtualization can be leveraged to provide controlled and secured access to operational data. This provides an opportunity for Data Quality teams.

      • By providing controlled source system access, data virtualization may be leveraged by data quality to analyze and resolve data quality issues.
      • If an organization has chosen to manage data quality down-stream from the data source (not in the source system), exposing operational data may resurface previously resolved data quality issues. Therefore, when implementing data virtualization, one must also consider where data quality occurs and consider what tolerance the user base has for un-cleansed data.

    It should now be clear that an enterprise perspective can greatly improve the success of a data virtualization implementation. Data virtualization is an enabling technology, but left unchecked, it can become yet another runaway IT challenge. This is particularly true if data virtualization is being considered for enterprise-wide deployment. The trick is finding the delicate balance between providing enough architectural and data management structure without stifling data virtualization uptake and corresponding innovation. In most cases, the long-term success of DV within an organization hinges on the ability of that organization to address each of the 8 steps listed above.