Storing and Harvesting Massive Data Sets

Can We Have It All?

Three-D seismic and the explosion of processing techniques that have continually made the data more useful are a vital part of today’s exploration and production projects — but this proliferation has created a problem for oil companies: What to do with all that data?

Companies are increasingly struggling with this costly issue.

Please log in to read the full article

Three-D seismic and the explosion of processing techniques that have continually made the data more useful are a vital part of today’s exploration and production projects — but this proliferation has created a problem for oil companies: What to do with all that data?

Companies are increasingly struggling with this costly issue.

As a result, new storage management techniques have been developed, and those are now evolving into knowledge management for large databases.

Hierarchical storage management basically allows companies to move data between online disk systems and near-line tape systems, depending on how frequently a dataset is used.

"This is an automated system that ensures your most recent and most accurate data is always on line at any given time," explains Jess Kozman, service delivery manager for Schlumberger Information Solutions.

However, that concept is now moving to the next level, into what Kozman calls hierarchical knowledge management.

Kozman, who presented a paper titled "Where Has All My Data Gone? Case Histories of Hierarchical Knowledge Management for Large Datasets" at the 2003 AAPG international meeting in Barcelona, cites several projects where hierarchical knowledge management was successfully implemented to solve data management issues, including:

  • One of the first successful implementations of hierarchical content management, at Chevron Overseas Petroleum Inc. in San Ramon, Calif., at what is now the ChevronTexaco Upstream Technical Computing Center.

    The management of large static trace files in the interpretation environment began in 2000 with the installation of a five terabyte automated tape library to manage application trace files stored on network-attached storage for six worldwide business units.

    "Business rules were put in place to allow the release to near-line media of trace files not accessed in 32 days from over 500 seismic projects containing up to 30,000 physical files," he said. "In addition, a second copy of the archived tapes was used to provide backup and disaster recovery capabilities."

    The system has grown to over 11 terabytes and over 100,000 physical files.

    "According to the ChevronTexaco project manager, seven terabytes of the data exists only on near-line tape, and she recently wrote, ‘I shudder to think of how we would have handled all that inactive data’ without the near line systems," he said. "She indicated the previous system involved hours of work creating offline tapes ‘destined to be lost in desk drawers,’ and would have eventually required the purchase of more network attached storage at substantially higher cost than that of tape."

    The near line system also saves money and pressure on the backup system, provides a comfort level for disaster recovery and makes it easier to retrieve files to a project than from a traditional UNIX backup system.

    "Standard backup tapes at ChevronTexaco are kept for only three months, so projects have been saved by having the data on near line tapes after discovering it had been deleted completely from disk at some unknown time, he said. "Since the technical center is billed internally for backup services, taking the managed disks out of the backup schedule also saves money."

  • At Conoco in Lafayette, La., a near line tape robotic system was installed to manage the deepwater Gulf of Mexico exploration division’s seismic project storage from 1999 to 2003. Online storage had grown to approximately 2.2 terabytes, driven by up to a terabyte per year in new delivered data and the addition of reprocessed and specialty processed volumes.

    "Conoco had purchased over 30,000 square kilometers of 3-D seismic surveys in five years to evaluate over 300 leases in the deepwater Gulf of Mexico," he said. "Manual intervention was required to manage the data volumes by physically backing up and deleting projects, taking interpretation time away from geoscientists."

    The projected cost of managing this growth by simply buying additional network-attached storage was calculated at reaching approximately $10 million by 2000 — an unacceptable figure in a cost control environment.

    The solution: A 25 terabyte capacity automated tape library and hierarchical project content management solution using business rules was used to move seismic trace volumes that had not been accessed in over 30 days to near line media, where they could be recovered when needed using four high speed, high reliability drives working in parallel.

    "Conoco determined independently that the implementation of a hierarchical near line storage strategy saved over $7 million over three years compared with the cost of buying additional network attached storage," Kozman said.

  • The latest development in the hierarchical storage of project content is from data and information to knowledge.

    A hierarchical knowledge management system manages large static data files, dynamic user files containing project information and archive files. Such a system was installed by German Oil & Gas Egypt (GEOGE) in Cairo, where approximately 2.6 terabytes of interpretation data is moved automatically between online and network attached storage devices and a robotic automated tape library, he said.

    Files from all three categories are identified by access patterns and segregated onto separate storage partitions and pools of tape media. File usage patterns are continuously monitored to gauge the effectiveness of the background processes and allow tuning of the system.

    The system provides effective storage management, backup and disaster recovery capabilities and a method to capture and archive the knowledge contained in projects at key milestones in the project life cycle, including preparation for application upgrades, according to Kozman.

    "Prior to the installation, GEOGE had more than one terabyte of data on disk with only manual backups of project seismic data and no scheduled backups for data outside of application projects," he said. "There was no in-place disaster recovery system. Ninety software application system projects were spread over 36 disk partitions and 30,000 physical files. Most of the disks were 95 to 100 percent full."

    Some seismic trace files in interpretation projects had not been accessed by users in up to nine months, and manually created backup files were occupying more than 200 gigabytes on a high-end, network-attached storage device.

    "Corrupted disks required system administrators to physically move data, one project at a time, and reload projects from tape — resulting in days of lost work," he said. "Plus, a 15 percent growth in seismic data over three quarters was predicted and disk usage had begun to grow exponentially."

You may also be interested in ...