OpenZGY/C++ API and Internals (ALPHA)
Access seismic data stored in ZGY format.
Algorithms for decimation
OpenZGY multi resolution vs. compression

Executive summary

Handling compressed multi resolution ZGY files is one of the items on my TODO-list that will take some time. There are problems both in the old ZGY library (needs a full resolution temporary file) and in OpenZGY (higher LOD levels are useless due to noise).

Copyright

Copyright 2017-2020, Schlumberger

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Intended Audience

This document is written for developers that need to understand some of the finer points about ZGY. It contains notes on generating multi resolution (LOD) data. This is not the place to start if you just want an introduction to OpenZGY.

Also, Wow, I am actually writing a design document before starting to code. I may need to tweak it later though.

OpenZGY multi resolution vs. compression

Combining the features of multi resolution and compression creates some very non trivial issues. This document tries to explain them, and describe 4 approaches to solving them.

  • Plan A (currently used in ZGY-Classic) handles compression by first storing the file uncompressed and then reading back every uncompressed brick (both full resolution and low resolution) and compressing then. The compression stage can be done either by invoking a separate application or as a library call.

  • Plan B creates each low resolution level LOD n by reading and decimating LOD n-1. This happens automatically when the file is closed. Noise gets accumulated in higher LOD levels to the point where very high levels become useless. The statistics and histogram will also contain one (but only one) level of noise. For uncompressed files this method is slightly better than plan C because there will be less fragmentation in the file, so it will read slightly faster if on the cloud. Also the complete histogram will be available once the lod 2 generation start up. The decimation algorithm might want that information. I am unsure whether the benefits outweigh the cost of maintaining and testing two algorithms.

  • Plan C (currently used in OpenZGY) is a smarter version of B where LOD 2 and higher is generated from an in-memory cache. All low resolution bricks get two levels of noise.

  • Plan C1 (no current plans) is like C in that it keeps data in memory until entire 643 bricks can be written out. But the decimation algorithms are run for all levels as early as possible. This means that LOD1 bricks will be as least 643 (can be larger in Z). The LOD2 bricks will be at least 323 and 8 of these have to be collected in memory before being written out. For LOD3, size at least 163 and 64 need to be collected. LOD8 and above will need to revert to a simpler algorithm because there will not be the 163 mini-brick needed to do anything fancier. The benefit of this plan is that generating all LOD levels in parallel allows for fancier algorithms where e.g. LOD2 can be computed directly from LOD0 data. Also it uses less memory. The drawback is that the Python version will run slower due to the need to operate on smaller numpy arrays. Also it is more work to implement.

  • Plan D (no current plans) is more efficient than C and also creates LOD 1 from the in-memory cache. And statistics from the uncompressed LOD 0. All bricks, both full and full resolution, have only one level of noise. However, Plan D places an additional burden on application code. Possibly tje zgy library must ask for data in the precise order it wants instead of having the application write in am arbitrary order. No plans to implement this one unless we are sure that some applications will use it.

Application considerations

This applies to both uncompressed and compressed data.

With plan A, B, and C it is recommended but not required that the application will write the entire survey with bricks sorted so the last dimension varies fastest and the first slowest. Other orderings will work but the file will then take more time to read if placed on the cloud.

Plan D requires that the ZGY library is in control of the ordering. ZGY will ask for data by invoking a callback instead of application data pushing data to ZGY. In practice this will be Z-order or Hilbert order.

Cloud block size optimization

This applies to both uncompressed and compressed data.

The speed with which data can be read from the cloud depends somewhat on choices made when the file was written. Which in turn may depend on which method is used to compute the low resolution data.

Assuming that the application writes in the recommended order, the full resolution bricks can normally be read back as contiguous if the caller requests (64, n*64, all) sections. This allows for faster reads from the cloud. Low resolution bricks are normally slightly less efficient with requests of (64, 64, all), (64, 128, all) and (128, 128, all) read contiguously and everything else needing multiple blocks. Note that (128, 128, all) is actually more efficient here than at LOD 0.

Plan A is optimal at every LOD level. Plan B and C work as described above; optimal at LOD 0 and slightly less so for low resolution. Plan D has the slightly less efficient layout also for full resolution.

Amount of I/O for compression

All alternatives will write each each low resolution brick exactly once. Plan A will also write each uncompressed full resolution and low resolution brick to a temporary file. Plan A, B, and C need to read back some or all the data that has been written, but only once. The cost of those reads are probably low compared to the write cost.

Comparison of plans

Plan Summary I/O Brick noise Stats noise Cost Ordering LOD 0 Ordering LOD >0 Memory consumption Main drawback
A Separate compression. Writes twice, reads once. 1 level (best) No Fairly simple. Optimal. Optimal. Max 4.5 brick columns @ LOD 0. Lots of temporary disk space.
B Read LOD n-1 to make LOD n. Writes once, Reads once. LOD >1 gets very bad 1 level. Fairly simple. Optimal. slightly slower. Max 4.5 brick columns @ LOD 0. Noise in LOD > 1
C More in memory cache. Writes once, Reads LOD 0 once. 2 levels in all LODs. 1 level. Tricky. Optimal. slightly slower. ~10 brick columns @ LOD 0 Fragmentation. Deferred histogram.
C1 All LODs made from LOD 0. Writes once, Reads LOD 0 once. 2 levels in all LODs. 1 level. Tricky. Optimal. slightly slower. ~5 brick columns @ LOD 0 Slow in Python. Slightly slow in C++.
D ZGY pulls data from app. Writes once, never reads. 1 level (best). No. Tricky. Slightly slower. Slightly slower. ~10 brick columns @ LOD 0 More work for apps.

Memory requirements

The memory requirement for plan C and D is higher because the algorithm is recursive and 5 brick columns need to be held at every recursion level. But, the brick column for LOD n will be half the size of the brick column at LOD n-1. Rounded up to a multiple of 64 (or 2 ?)

TODO lists

The implementation of plan C is in openzgy/impl/genlod.py. Several tweaks are possible but might not offer much value. The suggestions are documented below.

Custom Buffer class to replace numpy (C++)

Possible optimization for C++ but not for Python: Slightly over-allocate data buffers, maintaining both a reserved size and an actual size. This saves time if padding needs to be added to the data. Padding is needed because many decimation algorithms require the input data to have an even size. To implement this feature I need a specialized Buffer class, which I am not planning for the Python implementation where I just use np.ndarray. But in C++ I need to write a class Buffer anyway. So I might as well add that feature. Technically the logic to keep track of reserved and used data could also be done inside the _calculate() method. But that would complicate it ridiculously.

My custom buffer class should also be able to represent a constant-value buffer. In Python a scalar represents constant value (which annoyingly requires the size to be passed along as well). A regular brick is passed as an np.ndarray. C++ has stronger typing so this won't work as well there. Note that there is another way of handling this: Use virtual padding. All access to the buffered data wold go via a buffer::get(i,j,k) method which knows how to add padding. However, the performance implications of reading just one sample at a time are worrying.

More re-use of buffers. (C++)

Possible optimization which might happen: More re-use of buffers. Especially if plan B gets implemented since this does sequential access. In this case the "over-allocate data buffers" will be needed as well. So this is probably only feasible for the C++ version.

Scatter/gather buffers.

Possible optimization which probably won't happen: In a custom Buffer class the actual contents might be stored as a list of 64^3 bricks. Or whatever the basic brick size is in this ZGY file. This will reduce memory fragmentation. It can also make reading and writing data more efficient since the buffer won't need to be reshaped. On the other hand the decimation algorithm normally used for lod 1, which is the one that handles the most data, requires reshaping to get full traces. Reshaping is also needed when pasting 4 brick-columns into one. Because the brick size in the I direction is usually file.bricksize[0]. Which means a decimated version of that data is half the size it needs to be. So I am unsure whether this helps performance at all. Another issue is that the api would need separate read and write methods accepting data already broken down into bricks.

Larger chunks.

Possible optimization which I am considering: Read and write larger chunks. The initial implementation has each invocation of _calculate(lod=0) read 4 brick columns. Only two of these will normally be adjacent in the file. So the block size for reading, assuming the default brick size, is 64*64*N samples. With 1024 int8 samples per trace this means reading just 8 MB at a time. Writing at higher lod levels have even smaller block sizes because the size of one brick column gets halved for each successive lod level. On write this isn't actually a problem because data is buffered if it is going to be written to the cloud. But the low resolution data for different lod levels end up interleaved in the file. So this smaller block size becomes the largest block size that can be read from the file later. One solution is to pretend the file has larger bricks, by multiplying the size in the J direction with a power of 2. And if this reaches the survey size and we can still afford to allocate more memory than the brick size in I can also be multiplied by a power of 2. This should work (I think) without much change since the algorithm doesn't need to know the real file brick size. It just needs to trust that the size is chosen such that no writes will write just a part of a brick with more data being added later. Another option is to choose a new block size at each lod level, keeping the buffer size in MB roughly the same at each level. Note that both cases has some drawbacks with constant bricks not always being detected as such. See the next bullet. So while larger buffers are better this is only true up to a point. Try to aim for max 1 GB at each level which ends up writing 256 MB at a time. Another reason to not go overboard with the block size is that this makes the progress bar less useful. And, the current implementation hard codes that exactly 4 bricks of lod-1 are needed for one block of lod n. See _choose_blocksizes().

Smaller granularity of all-constant checks.

Possible optimization that probably won't happen: Maintain an is_constant flag for each brick column, not just for the entire buffer. Only needed when the optimization in the previous bullet is in force. When is_constant=True the decimation algorithm of that buffer doesn't need to be run at all. Larger buffers mean that many brick columns with only dead traces end up being decimated anyway because a neighboring brick column did have real traces. Fortunately this only makes a difference for surveys that are very oddly shaped. Such as an L-shape. Or in the case where the survey boundary is set much larger than the bounding box needed to contain it.

Implement plan B.

Possible optimization which probably won't happen: Implement an alternative non recursive algorithm (plan B) that reads bricks at level N to generate bricks at level N+1. This would be more efficient because I/O block size can be chosen freely and can also be larger since the code won't need to hold on to one set of buffers for each lod layer. The ordering of bricks in the output file would be optimal. The problem with this approach is that it won't work for compressed data. Because noise would accumulate for each level. I would then need to write maintain both algorithms. A single algorithm is bad enough, thank you.

Reduce memory footprint.

Possible optimization which probably won't happen: When making the 4 recursive calls, do a partial paste into the result buffer as soon as the data is returned. This reduced the memory footprint by 3/8 because the code only needs to hold on to one, not four, of the smaller buffers. But it complicates the code. Not least because the size of the result buffer might not be known until all 4 calls have completed.

Better use of the all-constant flag.

Possible optimization which probably won't happen: For generating lod 2 and higher the decimation algorithm can be run 4 times, one for each block from the recursive call. Both the readlod and readlod+1 data then needs to be pasted later. The benefit is saving time in the case where some but not all 4 brick columns get flagged as all constant. The drawback is slightly more code. And the decimation algorithm gets called more often with smaller sizes. Which can be inefficient in Python but probably doesn't make any difference for C++.