Usage of RVA and Snapshot Copy
Jaqui Lynch & Rita Leonard
& Richard Milner
The purpose of this paper is to discuss the migration from 3390s to RVA2 at the University and the performance changes seen after doing so. The main focus of the paper is on the performance changes seen as a result of using Snapshot Copy on the new configuration.
In late 1996/early 1997 it became necessary for the University to upgrade its current disk configuration from 3390s to a newer technology with considerably more disk space. At the time, the RVA2 subsystem was chosen for a number of reasons. This paper covers the migration process and some of the performance changes seen after the migration, some of which can be attributed to the RVA2 itself and others which are a direct result of the implementation of Snapshot Copy.
The starting configuration prior to the upgrade was:
9672-R31 with 256mb, 15 escon channels and 18 parallel. MVS/ESA v5.2 with OE and the Web Server running
LCU 2 – 3990-3 with 64mb
4 x escon chpids
2 x 3390-B2C
LCU F – 3990-3 with 64mb
4 x escon chpids
This basically meant that there was about 101gb of DASD available in the subsystem. Given the added workload that was expected because of a tremendous growth in web based applications (both MVS Web server and UNIX servers) plus the additional disk needed, it was decided to grow the subsystem by at least 45gb.
After the upgrade the configuration was:
160gb RVA2 with 1gb cache
LCU 16 and 17
6 x escon chpids
64 x 3390-3 per controller
The configuration is referred to as a 2 by 5+2+1 array, which means there are 10 disks for data, 2 parity disks and 1 spare in each of the 2 arrays. This provides a total of 45.339gb of usable disk, which, with compression and a maximum of 75% utilization, actually equates to around 160gb of DASD storage.
There were several reasons why RVA2 was chosen when the decision was made to migrate. The 3390s were starting to show their age. There was concern about the fact that when an HDA was lost, two volumes had to be recovered. Although not too many HDAs had been lost, it was expected that as the technology aged, failures would be more frequent. Several temporary errors had already shown up leading to the concern. Also causing concern were the number of performance problems experienced relating to IOSQ time. This suggested a need to split data across more actuators to reduce queuing. Of less concern to Systems Programming, but crucial to Operations, was the need for floor space in the computer room. With all of this in view, it seemed like a good time to upgrade the technology, rather than just adding more 3390 disk.
The efficiency of the 3390 technology was also called into question. There was a great deal of wasted space in the subsystem. There were five volumes dedicated to paging and several more with extremely busy catalogs and crucial files on them. In order to keep IOSQ at a minimum, no other data could be put on these volumes.
Apart from space, there was also a problem with the batch production window. Many of the jobs that were running had built in backups and restores that could not be eliminated. The batch window was extending into the online production day, negatively impacting our online users. This was deemed totally unacceptable and had to be addressed.
The RVA2 technology addressed the problems above in several different ways. Firstly, it allowed the systems person to generate many more virtual volumes than actual physical HDAs. These are then only written out to physical disk when data is actually stored. This allows the user to take advantage of what is termed Elastic Capacity. So, in the case of the paging volume, 100mb would be written out but the rest of the space for that volume goes into the pool that all volumes take space out of.
RVA2 also addressed the issue of floorspace due to its small size and low weight and was also a RAID-6 box, which could handle the loss of 2 HDAs before disaster struck. The RAID-6 was important as it means that there are two parity disks to ensure data recovery in case a second disk should die before the first is fixed. It was felt that this was vital for availability reasons.
More importantly the University was able to address the issues relating to performance, specifically IOSQ and the batch window, as well as the additional storage space needed.
Migration to the new technology was very straightforward and only one problem was experienced which was unrelated to RVA2 – one of our products did not support 4 digit chpid addresses and had to be upgraded. All of the options below were done in standalone time with no users on the system. A three part methodology was chosen for the migration as follows:
This was used to move data where the volume name did not have to remain the same and where the data could be copied while the system was still running. This was used for nonsystem volumes wherever possible.
This was used to move data where a one pack system had to be used (i.e. critical system datasets such as the spool) or where the volume name could not be changed.
This method was used to move individual datasets to different volumes, allowing us to spread the I/O and IOSQ around a little better.
After the migration was complete (about 2 months), performance data was reviewed for the system. This data was taken during our peak period during the day. The following changes were noted:
Pre RVA2 Post RVA2
CPU 65 85
I/O int. rate 150 200
Response 15ms 5ms
IOSQ 3ms 0ms
Pend .1ms .1ms
Disconnect 9ms 1ms
Connect 3ms 4ms
The above values are averages across the subsystem but it is obvious from both the numbers above and the drop in response time complaints that there was a significant improvement in performance. It is difficult to know how much of the increase in CPU was directly related to the RVA reducing I/O wait, because the tail end of the migration coincided with the start of one of the universities peak production cycles. It should also be noted that connect time is suspected to have increased because of the change from 4+4 escon channels to 6 escon channels.
When using RVA it is important to monitor the statistics shown on the operator console. It is recommended that the subsystem never run with more than 75% of the usable disk in use, as freespace collection will run at a higher priority at this point and this will degrade the performance of the subsystem. On the console this is the %net load which is also referred to as net capacity load. Net capacity load is affected not just by how much physical disk is available, but also by compression and compaction ratios, the frequency and location of track updates, and the time copies are retained or refreshed.
Compression is performed on all data while compaction involves the removal of the inter-record gaps that exist for ckd data. The data is stored in FBA format on the disk so these gaps are not necessary. The combination of the two typically runs from 3.5 to 4:5.
Freespace collection is a microcode recycle process that gathers partially empty array cylinders and writes all the valid data to a new array cylinder until it is full. This reduces fragmentation and regains freespace for writing. Disk space is in one of three statuses – a. Used - referred to on the panel as %net load, b. Collected – freespace available for write (%coll free sp) or c. Uncollected – free tracks marked for collection (%uncoll free).
Current utilization statistics on the subsystem as reported on the console are:
DA capacity 45.339gb
%net load 46.473%
%coll free sp 50.905%
%uncoll free 2.622%
Thus it can be seen that there is still plenty of disk space left (about 25% or 40gb) before the next upgrade is due.
The next step was to implement Snapshot Copy to see if it would be able to help with the Production backlog.In simplest terms, Snapshot allows instant ‘backups’ of files or full disk volumes without the use of tape or, to a large degree, physical disk space. The version of Snapshot available at the time did not handle Vsam datasets, so a methodology was decided on (more later) to circumvent that.
Briefly, RVA/2 records the physical location address of disk data tracks in a directory similar to a VTOC. Snapshot, when invoked, creates a duplicate copy of that directory. When data is modified, the original track is kept intact and a new track is written with the modified data. Only the original directory is updated with the location of the new track. The Snapshot copy is not changed. In the event that a restore of the original data is required, the Snapshot copy is copied back to the original directory. The pointer to the original unmodified track is now restored and the modified track will be released during idle space recovery.
Snapshot is a vast improvement over traditional backups for many reasons. The first is that there are not two copies of the data – new versions are only written out when they change – thus the DASD requirements for testing, recovery, etc are vastly reduced. It also uses less CPU time (elapsed and real) and performs fewer I/Os, which ensures substantial reductions in the time to run jobs. This means that enqueues and locks are held on datasets being backed up for much shorter periods of time, thus reducing the impact of those backups on production dramatically.
Snapshot can be used to snap a copy of data during a batch cycle and then batch can continue while the snapped copy is being copied to tape. Release 1.2 can also be used to backup VSAM and DB/2 files which improves the time taken to take image copies. DB/2 is down for a much shorter period of time for this process than previously.
Early discussions regarding the nature and level of testing resulted in a decision to use the Student Record System backups as a test. The reasons for this were numerous but all came down to the fact that this was where the largest impact from the product would be seen. This was a system where it was not unusual to run between 6 and 20 backups during the batch cycle and the batch window was no longer long enough for these jobs. All testing for the pilot project was done at the volume level.
To prepare for the pilot project, the following steps were taken:
At this point, it appeared that there were a couple of issues regarding the System Catalog that needed to be addressed. During one of the tests, an active test file was deleted from one of the volumes, which caused its entry to be removed from the Catalog. When the ‘snapped’ copy of this volume was restored, the file was restored but its catalog entry was, of course, not. This gave rise to speculation that things such as ‘high used RBAs’, might be out of sync; and that the restored volume might not be usable. Research and further testing proved that the volume was fully usable. There does need to be some manual intervention if a file is deleted and is then restored from a snapped image.
The "manual intervention" is actually an IDCAMS job that checks for the existence a catalog entry and "recatalogs" if it does not find it. The issue is not so much manual effort during the restore, but the need to keep control cards (which note specific dsn's, volsers) in sync with reality.
The concerns about RBAs, etc. are a non-issue as this information is kept in the volume VVDS and not in the catalog itself.
Next, the Production files used in the Student System were moved to the three isolated volumes and the test copies were deleted. The Snapshot jobstream was then put into Production and has been used nightly since then.
The time to run a Student System backup has been reduced from approximately 20 minutes to less than 2 minutes with no loss of control information or reporting. In an average night there are five or six backups taken for this system and, at times, up to 12 of them can be run. Based on an average of 6 backups per night this means that 2 hours worth of time has now been reduced to 12 minutes, a saving of 1.75 hours of batch processing for this stream in a night. During periods of heavy batch processing for this system, there could be as many as 12 backups in a night, giving more than 3.5 hours of saved time.
However, the major savings in time are in those instances where a restore becomes necessary. The time to restore files from a Student System backup has been reduced from 60 minutes to less than 2 minutes. The restore process also retains all of the control information and reporting of the tape process. (The forty minute difference in time between the backup and the restore is accounted for by the fact that the traditional tape restore procedure replaces only the base files for this system and several alternate index files need to be rebuilt. With the snapshot version, the alternate index files are restored just as the base files are.) Obviously, this puts the University in position to continue with batch Production a full hour sooner than with the tape method.
One slight negative is the loss of a "built-in reorg" when using the older style Idcams restores. The snapshot restores CI splits, extents, etc exactly as they were at the time the Snap was taken.
The hidden saving in the above is the number of tape volumes (3 per execution) which are not being mounted, unmounted, and filed away in the tape library after a night’s production. It also reduces the load on the tape robot at the busiest time of the night. Further, the operator is free to perform other tasks while this stream is running since they do not have to respond to any tape mount requests for it.
Once the appropriate steps were outlined, the time to set up a backup job to use Snapshot instead of tape was minimal. Since the pilot at least two more Production systems have been converted and have provided similar timesavings in the batch window.
Snapshot is playing a major role in the effort to expand Boston College’s on-line service availability (specifically UIS and QUEST) from its current schedule to one that exceeds 20 hours per day.
Use of Snapshot also removes many of the restrictions seen today on backup times and scheduling. It can be used at any time that there is a need to capture a file or volume image in a ‘static’ condition for use in any application. Full volume backups of the system are run with a number of restrictions regarding times and CICS status, etc. With Snapshot, the images of the volumes can be captured at a designated time and the backups then run against the images at more convenient times. If there is a need to do extensive reporting ‘between updates’ in a system, updating does not need to be suspended. The reporting can be done from a ‘snapped’ file or volume.
From the data above it is clear that RVA2 and Snapshot has been a very positive experience for the University. In particular, the time for backups and restores in the production systems was substantially reduced and peaks in the batch processing were smoothed out. The associated savings in tape mounts and tape volumes were also extremely promising. Additional performance improvements were seen, with a dramatic reduction in response time in the subsystem, due to the reduction in IOSQ and disconnect times.
Additional benefits realized are the reduction in floor space, better real utilization of disk space (no wasted space) and the freeing up of the operator’s time. The University is currently in the process of installing the Vsam version of Snapshot in order to take advantage of the facilities it offers at the Vsam dataset level.