Last updated: 2020-07-24
If not using compression and not reading from the cloud, multi-threading requests won't have much effect on performance because most of the cost is file I/O and (at least on my test machine) parallel reads don't speed things up.
A small speedup might be seen by parallelizing value scaling, value type conversion, and/or buffer reshaping but this might not be worth the trouble.
A small speedup might be seen by local parallelizing the decimation algorithm. I suspect the added overhead of parallelizing will kill that benefit. But if it doesn't, this tweak is actually the simplest one to implement.
This is relevant when compression and/or cloud I/O is involved.
The current situation is that application code is allowed to read from multiple threads if it wants to. Provided the file is opened read only. This by itself can speed up cloud reads and decompression. Files open for write must be accessed from just one thread at a time. Generating low resolution bricks is done sequentially.
If more parallel execution is desirable this can be done internally in the OpenZGY library (providing that the caller reads or writes large regions at a time) or by the application (sending read or write requests from multiple threads).
| Read | Write | |
|---|---|---|
| Large requests, MT in library | Can be implemented best with a wrapper. |
Can be implemented but needs refactoring. |
| Parallel requests from app | Already works | Can be implemented by locking. Higher risk. |
Handling parallel write requests from the application layer will require setting locks several places. See details later in this document. This approach has higher risk than the alternatives.
Internal parallelizing of large write requests can be handled by locally parallelizing the compression, reshape, and convert loops. Some refactoring is needed but I had this scenario in mind when I wrote the code. So I know what needs to change. The modified code will have larger memory usage as it will need to make another copy of the write buffer.
Handling parallel read requests from the application layer is supported.
Internal parallelizing of large read requests is probably best achieved by wrapping the read function with code that splits, executes, and joins the result. Which means this would just be a convenience so the application doesn't need to write that code itself. The main drawback, both for this and for the application doing its own split/read/merge. is that reading from the cloud can be more efficient with lather bricks.
Locally parallelizing selected code, such as doing only the decompress algorithm, may require significant refactoring. So if you want to do a single threaded large read (can in some cases be more efficient for cloud access) but also want to parallelize decompression one brick at a time then this will be expensive to code. Local parallelizing of reshape and convert is possible but will probably not have much effect on performance.
| Read | Compute | Write | |
|---|---|---|---|
| Issue larger requests | Moderate work | Not useful. | Moderate work. |
| Issue parallel requests | Simple, for 4 requests. | Simple, if it helps. | Probably not feasible. |
When the library calculates and writes low resolution bricks the code that does so acts like an application. It has several opportunities for multi threading.
Locally parallelize the cpu bound decimation algorithm. Simple to do, but it is uncertain whether the overhead negates the benefits. TODO: measure first, implement later.
Issue larger requests both for read and write. This might help with increasing the block size for cloud read access. If the bulk layer is changed to do internal parallelization then there may be an additional benefit there. The low resolution code is designed to issue as large requests as possible given the memory constraints. Some extra work might be required to enable this.
Issue bulk read requests in parallel. It looks like I can fairly easily issue 4 read requests but still just one write request in parallel. More is technically possible but looks neither feasible nor useful. Because it negates the "larger request" bullet above. The reason 4 requests is possible is that the current code will send 4 sequential requests every time it needs data. Those can easily be made in parallel. Consolidating them into one is in this particular case not trivial. See comments in the source code.
To make the write() method thread safe for application code it will need to place a lock whenever it needs to access a shared resource. The risk of forgetting a lock somewhere is high. I believe it is safer (but still not risk free) is to by default a per-file lock should be held whenever doing any bulk access. The lock can then be temporarily dropped then re-acquired when doing simple operations that clearly do not access shared data. The risk now is to make sure that general lock is set in the first place.
The general rule will be to set the lock when transitioning from the api layer to the implementation layer. The lock is optional when accessing metadata that is immutable once the file has been opened. A more detailed design is shown in the following figure.
To avoid deadlocks, all locks will be assigned a rank. Code that is holding a lock with rank N is forbidden to call code that directly or indirectly might cause a lock with the same or higher rank to be acquired. Or to put it simpler: If you want multiple locks then you need to acquire them in a specific order.
Note that I won't try to check this at runtime. That is rather difficult. Also not worth he trouble when there are just a few locks.
The API layer will not have any locks.
The implementation layer will have one lock per open file in ZgyInternalBulk::_bulk_lock (rank 10) to protect reading and writing data bricks and for reading and changing the lookup tables.
The implementation layer can also use ZgyInternalBulk::_misc_lock (rank 0) as a very short lived protection of some data structures. Rank 0 implies that no other locks may be acquired while holding _misc_lock, while it is safe to hold any other lock already.
The meta part of the implementation layer doesn't in general need any locking because most the information is immutable after the file has been opened. The exceptions are handled by finding out which API functions might trigger changing the value, and then declaring that as thread safe. Lookup tables (sort of part of the meta data) are protected by _bulk_lock.
This means it is safe to access metadata both with and without holding _bulk_lock.
Since _bulk_lock is owned by ZgyInternalBulk but protects data in two of the meta headers (lookup tables), it is the responsibility of ZgyInternalBulk members to make sure the lock is held when accessing those. And, conversely, no other code than members of ZgyInternalBulk are allowed to access them. TODO verify.
Any code that implements caching is responsible for protecting the cache. If a lock is needed then this will have a lower rank than _bulk_lock. This means that _bulk_lock may or may not be held when the cache is accessed. AND, it means that while the cache lock is held, the code is forbidden to call any method that could potentially cause _bulk_lock to be acquired. Since the file layer sits below the bulk- and meta layer this ought to be doable.
A file back-end may declare itself as either thread safe or not. If not thread safe then _bulk_lock will probably be held when doing reads or writes. But don't count on it. Related: xx_threadsafe() applies to both read and write. Do I needs more granularity?
Several methods in ZgyReader, ZgyWriter, ZgyUtil will be declared as not thread safe. No other operations, even those methods normally considered thread safe, may be pending when these are executed. The code will not set a lock because that could easily deadlock. There might be a check though. Set a flag saying that one of these operations are in progress and by whom. The flag is a simple variable because it is protected by the main lock. Anybody that sets or re-acquires the main lock should then immediately check whether the "owner" flag is set and is not this thread. Throw an error if the check fails.
The following methods might need a private lock.
These methods will set the file-level lock: MAKE SURE they are not accidentally called from below with the lock already held. Yes I have heard about recursive locks. But those are a major code smell and suggests your call graph looks like spaghetti.
For purposes of locking, the following will be treated as part of the API layer. I.e. they will not lock, and it is not allowed to call them with the lock already set.
These methods are at a lower level. Unless otherwise noted the lock should already be held when they are invoked.
One important exception: Inside ZgyInternalBulk::readToExistingBuffer() the lock will be dropped before _file->xx_readv() is invoked, if the backend is thread safe. If not then the lock should be dropped inside the deliverance lambda when the read returns. Either way the ZgyInternalBulk::_deliverOneBrick() method expects to be called without the lock.
If the application uses multi threaded writes of data that is not aligned to brick boundaries then the library needs to process these as read/modify/write. This is problematic with respect to multi threading because the read part might need data from a brick waiting to be flushed from another thread. There are ways around this but all are somewhat unpalatable.
Do locking at block level, so a read from a brick that is pending write is blocked. This adds both overhead and complexity.
Do locking of rectangular regions, done at the same place where the regular lock is acquired. This locking is coarser and thus has less overhead but more expensive to code.
Only allow muli threaded write when the output is a compressed local file. Compressed files are not allowed to (or at least really should not) write misaligned data. This is what the old ZGY-Public library does. And in the old code the stand alone compress application was the only one allowed to do parallel writes. Incidentally, am I sure this was respected by Petrel? I guess that question is getting rather academic now.
Only allow writing misaligned data if the application explicitly promises not to use multi threading. Throw an error if the app tries to multi-thread without making that promise. I don't like this solution because it makes the application even more aware of the ZGY brick size, Which ought to have been an implementation detail.
Whenever a misaligned request is seen, place a super-lock that eventually causes all the current writes to complete. Then perform the r/m/w in a single thread, then open up for regular traffic. This is simpler than "lock rectangular regions" but does more coarse grained locking. The main caveat with this approach is that if the application makes one misaligned write request it is reasonable to assume that most subsequent requests will also be misaligned. Which means the execution ends up completely serialized anyway.
When a zgy file is on the cloud or meant to be moved to the cloud it needs to have its bricks written in a specific order for performance. This gets tricky to handle if doing writes in parallel. A compromise may be needed. Application must write full brick-columns (ensuring that columns stay together but giving less opportunities for parallelization. The file on the other hand allows individual columns to be stored at random locations. Giving somewhat but not terribly worse performance.
Another possibility is to have the application explicitly state which regions will have non-constant data. If it kows. All bricks can then be pre-allocated in the right order before any real writes start. This does make the api more difficult to use, though. This is generally considered a bad thing.