The File Service allows users to manage files on the data platform. File Service provides features like upload, download, creation and retrieval of metadata records for files.
As part of metadata input, the APIs have the provision to describe the information associated with the file as well as describe the content of the file. The content description could be of use for any workflow that would like to extract or process the contents of the file.
The File service defines the following workflows:
- File Upload
- File Metadata
- File Download
Required roles: service.file.editors or service.file.viewers.
These endpoints are used to generate signed URL, and are used by users or an applications to upload a file for ingestion:
/v2/getLocationThis is a POST endpoint, that creates a new location in the landing zone to upload a file. If a FileID isn't provided in the request, the File Service generates a Universally Unique Identifier (UUID) to be stored in FileID. If a FileID is provided and is already registered in the system, an error is returned.The generated signed URL has a maximum duration of 7 days.
The response body constains
FileIDalong with a Location object which contains the Signed URL within it. Using this signed URL user can upload the file.Note:
- The FileID must correspond to the regular expression: ^[\w,\s-]+(.\w+)?$.
- This endpoint will be deprecated and the recommendation is to not use it. Please refer to
/v2/files/uploadURLfor generating signed URL to upload a file.
/v2/files/uploadURLIt is a GET endpoint to upload a file. This generates a temporary signed URL to upload a file.Note:
- File service only provides the URL, signifying the location, where the file can be uploaded. It is the responsibility of the user to upload the file using this URL.
- While using the generated URL to upload the file, there can be limitations from the cloud provider on the maximum file size that can be uploaded at once. Users must check the allowed limit set by the cloud provider with whom they are trying to upload the file.
- The signed URL expires after a set time that varies as per the environment. For example Azure implementation has the expiry limit set to 7 days.
- When generated URL expires, it cannot be used anymore to upload a file. The user should request a new signed URL.
- When the generated URL expires in the middle of file upload, the upload will continue, and the file will be uploaded.
The user receives a
FileSourcein the response. This is the relative path where the uploaded file will persist. Once the file is uploaded successfully,FileSourcecan then be used to post metadata of the file. The uploaded file goes in landing zone and gets automatically deleted, if the metadata is not posted within 24 hours of uploading the file.
The metadata schema not only allows users to define the attributes/properties of the file, like name, size etc but also allows users to define and describe the contents of the file. This can be done using the ExtensionProperties. FileContentDetails part of the ExtensionProperties.
The File Service includes the content details in the file metadata records. The main consumers of this information are workflows that get triggered after a file is uploaded and is discoverable.
The schema for providing the metadata information for a file can be found here: Generic File Metadata Schema
This is the sample metadata for CSV file: Sample Generic File Metadata
These endpoints are used to perform create, read and delete operations on file metadata:
/v2/files/metadataThis is a POST endpoint that creates a metadata record for a file that is already uploaded. The Metadata is linked to the file via the FileSource provided in the request body.
If FileSource attribute is missing in the request body, "FileSource cannot be empty" error is returned. If there is no file present, then the request fails with an error "Invalid source file path to copy from /osdu-user/1614784413120-2021-03-03-15-13-33-120/da92f52401dc4d1cb93515f159c110d4"
When metadata is successfully updated in the system, the file is copied to persistent zone and then deleted from landing zone. Success response returns the Id of the file metadata record.
If Name field of fileSource is given in metadata payload then the user can download file with the same name and content type using downloadURL API.
/v2/files/{Id}/metadataThis is a GET endpoint which return the latest version of File metadata record identified by the given
Id./v2/files/{Id}/metadataThis is a DELETE endpoint which deletes the File metadata record identified by the given id and file associated with that metadata record.
Note:
The endpoint does not modify storage records holding reference to this deleted File Metadata record.
The below endpoints are used to generate the signed URL used to download and access the already uploaded file content.
/v2/files/{Id}/downloadURLThis is a GET endpoint to generate a download signed URL for the files that were already uploaded and whose metadata was also created. For all such files, users should provide a unique fileId. By default, the download URL is valid for 7 days. But the user can override this value using theexpiryTimequery parameter while requesting a download URL. If the value provided for this parameter is more than 7 days, the expiration time will be reset to 7 days only. Accepted Regex patterns for this parameter are "^[0-9]+M$", "^[0-9]+H$", "^[0-9]+D$" denoting Integer values in Minutes, Hours and Days respectively. This download signed URL allows the user to download and access the content of the file.Note:
- When generated URL expires it cannot be used anymore to download the file. The user should request a new signed URL.
- When the generated URL expires in the middle of file download the download will continue, and the file will be downloaded successfully.
- If Name field of fileSource is given at the time of metadata creation then this download signed URL allows the user to download file with the same name and content type.
/v2/getFileLocation
This is a POST endpoint, which returns the Location (signed URL) and Driver (vendor name) for a given FileId shared within request body.
Note:
- This endpoint will be deprecated and the recommendation is to not use it. Please refer to
/v2/files/{Id}/downloadURLfor generating signed URL to download a file.
The File Service implementation performs a general check of the validity of the authorization token and partition ID before the service starts generation of a location. For accessing the file metadata, legal tags and ACL associated with the file are validated.
However, File Service won’t do any validation on attribute value passed in payload, it is the user's responsibility to pass right value by looking at description & pattern of that attribute in schema (schema reference can be found below in reference section) and File Service doesn't perform any verification on whether a file upload happened or whether the user started ingestion after uploading a file.
The File service doesn't look inside the file to validate the content within.
File Service publishes the status of File Metadata creation to statuschangedtopic which can be used to track status of File data set details and status of operation. statuschangedtopic is a pub/sub based topic that is configured in cloud. Consumer services can subscribe to this topic to receive status change events.
Status data is distributed across multiple tables for tracking whether dataflow has finished or not and if it is Successful or Failed. One holds dataset details and the other will hold overall status of that dataflow journey.
- DataSet Details - Dataset can be anything that contains data, for e.g., File is one type of dataset which contains data inside.
- Status - Status hold status of File Metadata creation.
Following are type of status/dataset details which File service publishes -
Below is an example Data Set Details that successfully got created. Here, datasetId is taken from file metadata record id and datasetVersionId is taken from file metadata record version.
| correlation-id | datasetId | datasetVersionId | datasetType | recordCount | timestamp |
|---|---|---|---|---|---|
| 123xxx456 | osdu:dataset--File.Generic:123 | osdu:dataset--File.Generic:123:123456 | FILE | 1 | 176890000 |
Below is the sample status that gets generated against the correlation id. Here the Status table contains recordId and recordIdVersion where 'record' represents the file metadata record.
| correlation-id | recordId | recordIdVersion | stage | status | message | errorCode | userEmail | timestamp |
|---|---|---|---|---|---|---|---|---|
| 123xxx456 | DATASET_SYNC | IN_PROGRESS | Metadata store started | 0 | abc@xyz.com | 176890000 | ||
| 123xxx456 | osdu:dataset--File.Generic:123 | osdu:dataset--File.Generic:123:123456 | DATASET_SYNC | SUCCESS | Metadata store completed successfully | 0 | abc@xyz.com | 176890000 |
Data Set Details contains data like datasetId, datasetVersionId and others fetched from file metadata record. If file metadata creation failed then required information to create the Data Set details can not be determined. Hence, Data Set Details can not be published if status of metadata creation is failed.
File service propagates the error messages that it receives as an error from Storage service when metadata creation failed.
File service can publish its own error message in failure status when there is any failure in File service itself.
| correlation-id | recordId | recordIdVersion | stage | status | message | errorCode | userEmail | timestamp |
|---|---|---|---|---|---|---|---|---|
| 123xxx456 | DATASET_SYNC | IN_PROGRESS | Metadata store started | 0 | abc@xyz.com | 176890000 | ||
| 123xxx456 | DATASET_SYNC | FAILED | createOrUpdateRecords.records[0].data: Record data cannot be empty | 500 | abc@xyz.com | 176890000 |
All available File service APIs are listed in the following: Open API Specification