Storing and Harvesting Massive Data Sets
Can We Have It All?
Three-D
seismic and the explosion of processing techniques that have continually
made the data more useful are a vital part of today’s exploration
and production projects — but this proliferation has created a
problem for oil companies: What to do with all that data?
Companies
are increasingly struggling with this costly issue.
Please log in to read the full article
Three-D
seismic and the explosion of processing techniques that have continually
made the data more useful are a vital part of today’s exploration
and production projects — but this proliferation has created a
problem for oil companies: What to do with all that data?
Companies
are increasingly struggling with this costly issue.
As a result,
new storage management techniques have been developed, and those
are now evolving into knowledge management for large databases.
Hierarchical
storage management basically allows companies to move data between
online disk systems and near-line tape systems, depending on how
frequently a dataset is used.
"This
is an automated system that ensures your most recent and most accurate
data is always on line at any given time," explains Jess Kozman,
service delivery manager for Schlumberger Information Solutions.
However,
that concept is now moving to the next level, into what Kozman calls
hierarchical knowledge management.
Kozman,
who presented a paper titled "Where Has All My Data Gone? Case
Histories of Hierarchical Knowledge Management for Large Datasets"
at the 2003 AAPG international meeting in Barcelona, cites several
projects where hierarchical knowledge management was successfully
implemented to solve data management issues, including:
- One of the first successful implementations of hierarchical content
management, at Chevron Overseas Petroleum Inc. in San Ramon, Calif.,
at what is now the ChevronTexaco Upstream Technical Computing Center.
The management
of large static trace files in the interpretation environment began
in 2000 with the installation of a five terabyte automated tape
library to manage application trace files stored on network-attached
storage for six worldwide business units.
"Business
rules were put in place to allow the release to near-line media
of trace files not accessed in 32 days from over 500 seismic projects
containing up to 30,000 physical files," he said. "In
addition, a second copy of the archived tapes was used to provide
backup and disaster recovery capabilities."
The system
has grown to over 11 terabytes and over 100,000 physical files.
"According
to the ChevronTexaco project manager, seven terabytes of the data
exists only on near-line tape, and she recently wrote, ‘I shudder
to think of how we would have handled all that inactive data’
without the near line systems," he said. "She indicated
the previous system involved hours of work creating offline tapes
‘destined to be lost in desk drawers,’ and would have
eventually required the purchase of more network attached storage
at substantially higher cost than that of tape."
The near
line system also saves money and pressure on the backup system,
provides a comfort level for disaster recovery and makes it easier
to retrieve files to a project than from a traditional UNIX backup
system.
"Standard
backup tapes at ChevronTexaco are kept for only three months, so
projects have been saved by having the data on near line tapes after
discovering it had been deleted completely from disk at some unknown
time, he said. "Since the technical center is billed internally
for backup services, taking the managed disks out of the backup
schedule also saves money."
- At
Conoco in Lafayette, La., a near line tape robotic system was installed
to manage the deepwater Gulf of Mexico exploration division’s
seismic project storage from 1999 to 2003. Online storage had grown
to approximately 2.2 terabytes, driven by up to a terabyte per year
in new delivered data and the addition of reprocessed and specialty
processed volumes.
"Conoco
had purchased over 30,000 square kilometers of 3-D seismic surveys
in five years to evaluate over 300 leases in the deepwater Gulf
of Mexico," he said. "Manual intervention was required
to manage the data volumes by physically backing up and deleting
projects, taking interpretation time away from geoscientists."
The projected
cost of managing this growth by simply buying additional network-attached
storage was calculated at reaching approximately $10 million by
2000 — an unacceptable figure in a cost control environment.
The solution:
A 25 terabyte capacity automated tape library and hierarchical project
content management solution using business rules was used to move
seismic trace volumes that had not been accessed in over 30 days
to near line media, where they could be recovered when needed using
four high speed, high reliability drives working in parallel.
"Conoco
determined independently that the implementation of a hierarchical
near line storage strategy saved over $7 million over three years
compared with the cost of buying additional network attached storage,"
Kozman said.
- The
latest development in the hierarchical storage of project content
is from data and information to knowledge.
A hierarchical
knowledge management system manages large static data files, dynamic
user files containing project information and archive files. Such
a system was installed by German Oil & Gas Egypt (GEOGE) in
Cairo, where approximately 2.6 terabytes of interpretation data
is moved automatically between online and network attached storage
devices and a robotic automated tape library, he said.
Files from
all three categories are identified by access patterns and segregated
onto separate storage partitions and pools of tape media. File usage
patterns are continuously monitored to gauge the effectiveness of
the background processes and allow tuning of the system.
The system
provides effective storage management, backup and disaster recovery
capabilities and a method to capture and archive the knowledge contained
in projects at key milestones in the project life cycle, including
preparation for application upgrades, according to Kozman.
"Prior
to the installation, GEOGE had more than one terabyte of data on
disk with only manual backups of project seismic data and no scheduled
backups for data outside of application projects," he said.
"There was no in-place disaster recovery system. Ninety software
application system projects were spread over 36 disk partitions
and 30,000 physical files. Most of the disks were 95 to 100 percent
full."
Some seismic
trace files in interpretation projects had not been accessed by
users in up to nine months, and manually created backup files were
occupying more than 200 gigabytes on a high-end, network-attached
storage device.
"Corrupted
disks required system administrators to physically move data, one
project at a time, and reload projects from tape — resulting in
days of lost work," he said. "Plus, a 15 percent growth
in seismic data over three quarters was predicted and disk usage
had begun to grow exponentially."