Documentatiion update WIP

draft_specifications
meerkat 2021-10-10 13:07:46 +11:00
parent dfdfd716dd
commit d873cf3abb
11 changed files with 140 additions and 64 deletions

View File

@ -2,33 +2,46 @@
**marti** stands for metadata reconcilation for transfer information. **marti** stands for metadata reconcilation for transfer information.
The objective is the provide transfer information for high volume data such as The objective is to provide transfer information for high volume data such as
in files. The document (files) can be transferred via HTTPS, SFTP, message queue, in files. The files can be transferred via HTTPS, SFTP, message queue,
network share or other. The transfer information being described here does not network share or other. The transfer information being described here does not
need to arrive via the same channel and could be received via email or need to arrive via the same channel and could be received via email or
even synchronous / asynchronous API. The transfer information does not dictate or even synchronous / asynchronous API. The transfer information does not dictate or
determine how the data is formatted. determine how the data is formatted.
The transfer information can provide details on the document format, but in itself The transfer information can provide details on the file format, but in itself
it does not understand the data fomrat. it does not understand the data format.
**Note**: The terms file and document are intended to be interchangeable
through out this documentation.
**marti** is intended to provide minimum basic information on the transfer with **marti** is intended to provide minimum basic information on the transfer with
ability to include additional optional information. The metadata reconcilation ability to include optional information. The metadata reconcilation
transfer document being decscribed here wil be referred to as the [Marti](Marti.md) transfer document being decscribed here wil be referred to as the [marti document](Marti.md)
document throughout this documentation. throughout this documentation.
The information is supplied as a separate document which could be another file The transfer information is supplied as a separate document which could be another file
or supplied via API by the publisher notifying the consumer(s). or supplied via API by the publisher notifying the consumer(s).
## Tools and Scenarios ## Tools and Scenarios
Tools and code snippets are provided to generate the information and then Tools and code snippets are provided to generate the transfer information and then
assist in reconcile the document contents once received. Refer to the assist in reconciling the document contents once received. Refer to the
programming folders for more details or [Tools](tools.md) for more general [source programming folders](source/) for more details or [Tools](tools.md) for more general
information information
[!div class="op_single_selector"]
- [Java](source/java/README.md)
- [golang](source/golang/README.md)
- [python](source/python/README.md)
- [powershell](source/powershell/README.md)
- [docker](source/docker/README.md)
## Transfer information ## Transfer information
The information in the **marti** document is summarised below. For more detailed
information see [marti definition](marti.md)
### Mandatory information ### Mandatory information
The mandatory information is: The mandatory information is:
@ -37,10 +50,9 @@ The mandatory information is:
* Unique identifier * Unique identifier
* Distribution list - See Distribution section summary below or detailed document [Distribution](docs/distribution.md) * Distribution list - See Distribution section summary below or detailed document [Distribution](docs/distribution.md)
### Optional information ### Optional information
The option information is: The optional information is:
* Description * Description
* Modified * Modified
@ -59,18 +71,20 @@ The option information is:
### Information extension ### Information extension
The information supplied can be extended by agreeing parties and there The information supplied can be extended by party agreement and there
are place holders in the defintion. are place holders in the defintion.
### Distribution ### Distribution
The distribution section can be repeated, but at least one must be included. The distribution section is intended to allow multiple data files to be
If the distribution is repeated it will comonly be for definiting grouped together. The distribution section can be repeated, but at least
multiple formats of the same data. one must be included. If the distribution is repeated it will commonly
be for definiting multiple formats of the same data or batching of
different data together from the same extract process.
* Title * Title
* Unique identifier * Unique identifier
* Document name - If no download URL, then this will be the document name * Document name
* Issued date - When the document was made available. The date can include time * Issued date - When the document was made available. The date can include time
* Modified - When the document was created or modified. This is the data and time * Modified - When the document was created or modified. This is the data and time
* Size of document - The document size in bytes * Size of document - The document size in bytes
@ -79,8 +93,8 @@ multiple formats of the same data.
### Distribution optional ### Distribution optional
The following are some of the optional items in the distribution section. See [Distribution](dstribution.md) The following are some of the optional items in the distribution section. See [Distribution](docs/distribution.md)
for more items and details for more details
* Description * Description
* Download URL * Download URL
@ -88,5 +102,3 @@ for more items and details
* Format * Format
* Compression * Compression
* Encryption * Encryption

View File

@ -20,3 +20,6 @@ can adjust if they resonate with your circumstances,
5. [Load quality metrics support](quality.md) 5. [Load quality metrics support](quality.md)
6. [Comparison of marti definition](comparison.md) 6. [Comparison of marti definition](comparison.md)
7. [References](references.md) 7. [References](references.md)
[!INCLUDE [marti High Level Definition](../marti.md)]

View File

@ -1,4 +1,42 @@
# Comparison of marti definition # Comparison of marti document definition
The use of metadata definitions is not unique and examples
exist in many different situations. Some are standard and open
while others are closed.
Some open standards are EXIF data for pictures, SQL DDL defintions
for databases, the XMP definition and web header responses before the
web content.
The **marti** document definition is intended to cover the situation
where data files are being transferred and reconciliation is required.
The **marti** document definition is modelled on the [CKAN API metadata](https://docs.ckan.org/en/2.9/api/index.html)
which has been adapted to included additional elements relevant to when
you are exchanging data files. This includes the reconciliation elements
such as number of records and file hash.
As the definition is based on the CKAN API, there are tools to import
a CKAN source into a **marti** document definition and then process the data
through the pipeline as you would for any other data file that had a
**marti** document definition.
## Benefit of CKAN and marti
The CKAN is excellent at defining the data source details but it lacks information
for load quality. If you have CKAN deployed in your organisation and wish
exhange or process the data referenced in CKAN, then there are synergies between
CKAN and marti.
Samples exist on CKAN integration.
## Magda and marti
Another source of data is [Magda](https://magda.io/) which has API metadata
definitions. Magda is more about data fedaration and as such provides
functionality on finding data sources and describing the contents.
The Magda software is able to generate APIs and data content. This does not
address the needs of data processing pipeline when reconciliation is required.
If you have Magda data sources then synergies exist between Magda and marti.

View File

@ -1,67 +1,72 @@
# Distribution definition # Distribution definition
The distrubution definition describes a single document, though The distrubution section defines the files that are grouped
some documents may expand to multiple documents if they are together by association. This association is not defined but can
compressed with a utility such as WinZIP or 7ZIP include different formats of the same data or a common batch extract
such as end of day.
Some files may expand to multiple files if they are
compressed with a utility such as WinZIP or 7ZIP. In the situation
where a ZIP file expands to multiple documents, then the expectation is
that the ZIP file contains a **marti** document describing its contents.
The elements in the distribution section are:
* Title * Title
* Document name - Commonly being absolute or relative file name. * Document name - Commonly being absolute or relative file name.
This value could also be an URL address or network path This value could also be an URL address or network path
* Issued date - When the document was made available. The date can include time * Issued date - When the document was made available. The date can include time
* Modified - When the document was created or modified. This is the data and time * Modified - When the document was created or modified. This is the data and time
* Size of document - The document size in bytes * Size of file - The file size in bytes
* Hash of document - The hash of the document, which can be blank especially for large documents * Hash of file - The hash of the file, which can be blank especially for large files
* Hash algorithm * Hash algorithm
The following are optional in the distribution section. The following are optional in the distribution section.
* Identifier * Identifier
* Description * Description
* Download URL * Download URL
* Version - Document version. The same document coudl be updated or this might denote the next version * Version - File version. The same file could be updated or this might denote the next version
of a regular report. For example a daily extract will have the version number incremented of a regular report. For example a daily extract will have the version number incremented
every day and provide a new URL. The previous document can be retained. every day and provide a new URL. The previous file can be retained.
* Format - if not specified then the consumer will in all likelihood use the document extension / mime type * Format - if not specified then the consumer will in all likelihood use the file extension / mime type
* Media Type * Media Type
* Expiry Date - The date and time that this document expires and can be removed from the download URL * Expiry Date - The date and time that this file expires and can be removed from the download URL
location. This is not the document retention period as might be required for archiving. location. This is not the file retention period as might be required for archiving.
* Described By - A link to the metadata describing this document data and format * Described By - A link to the metadata describing this file data and format
* Compression - Type of compression used if any * Compression - Type of compression used if any
* Encryption - Type of encryption used if any * Encryption - Type of encryption used if any
## Compression ## Compression
Documents can be compressed using a utility. A single compressed document can contain Files can be compressed using a utility. A single compressed file can contain
multiple documents. The Marti definition document applies to the compressed document multiple files. The **marti** definition document applies to the compressed file
and not to the contents, which could be multiple documents. and not to the contents, which could be multiple files.
In the case of a compressed document, there should be a Marti definition document in the In the case of a compressed files, there should be a **marti** definition document in the
compressed document to match the data document. That is the number of the records in a compressed file.
compressed document should always be an even number.
Compression of documents always occur before encryption. Compression of files always occur before encryption.
### Marti definition for Compressed Document ### Marti definition for Compressed File
For a compressed document that is not encrypted, the distribution definition will be: For a compressed file that is not encrypted, the distribution definition will be:
* Title - The compressed document title which could be a group name * Title - The compressed file title which could be a group name
* Document name - Commonly being absolute or relative file name. * Document name - Commonly being absolute or relative file name.
This value could also be an URL address or network path This value could also be an URL address or network path
* Issued date - When the compressed document was made available. * Issued date - When the compressed file was made available.
* Modified - When the compressed document was created or modified. This is the data and time * Modified - When the compressed file was created or modified. This is the date and time
and is not the modified date of the document in the compressed document. and is not the modified date of the file in the compressed file.
* Size of document - The compressed document size in bytes * Size of file - The compressed file size in bytes
* Hash of document - The hash of the compressed document, which can be * Hash of file - The hash of the compressed file, which can be
blank especially for large documents blank especially for large files
* Hash algorithm * Hash algorithm
The reason for this approach is it allows a generic tool to be deployed to The reason for this approach is it allows a generic tool to be deployed to
check the validity of the contents without unpacking the received /fetched check the validity of the contents without unpacking the received /fetched
document. That is you can perform load quality pipeline processing. file. That is you can perform load quality pipeline processing.
## Encryption ## Encryption
@ -72,22 +77,22 @@ provide encryption within the tool execution.
If the compression is TAR or GZIP then you may consider applying a GPG If the compression is TAR or GZIP then you may consider applying a GPG
or other encryption algorithm to the compressed file. or other encryption algorithm to the compressed file.
* Title - The encrypted document title * Title - The encrypted file title
* Document name - Commonly being absolute or relative file name. * Document name - Commonly being absolute or relative file name.
This value could also be an URL address or network path This value could also be an URL address or network path
* Issued date - When the **encrypted** document was made available. * Issued date - When the **encrypted** file was made available.
* Modified - When the **encrypted** document was created or modified. * Modified - When the **encrypted** file was created or modified.
This is the data and time and is not the modified date of the encrypted document. This is the data and time and is not the modified date of the encrypted file.
* Size of document - The **decrypted** document size in bytes * Size of file - The **decrypted** file size in bytes
* Hash of document - The hash of the **decrypted** document, which can be * Hash of file - The hash of the **decrypted** file, which can be
blank especially for large documents blank especially for large files
* Hash algorithm * Hash algorithm
The rational for using the decrypted document attributes is that an ecrypted The rational for using the decrypted file attributes is that an ecrypted
document is unlikely to be able to be modified without knowing encryption keys. file is unlikely to be able to be modified without knowing encryption keys.
Checking the decrypted document attributes is a better check wheer appropriate. Checking the decrypted fille attributes is a better check.
The reason for this approach is it allows a generic tool to be deployed to The reason for this approach is it allows a generic tool to be deployed to
decrypt and check the validity of the received / fetched document without decrypt and check the validity of the received / fetched file without
needing to understand the contents. That is you can perform load quality needing to understand the contents. That is you can perform load quality
pipeline processing. pipeline processing.

View File

@ -1,5 +1,5 @@
# Contents # Contents
This is the parent directory for **mati** tools written in various languages. This is the parent directory for **marti** tools written in various languages.
Please browse the folder for the programming language of interest. Please browse the folder for the programming language of interest.

View File

@ -0,0 +1,3 @@
# Place maker

View File

@ -0,0 +1,3 @@
# Place maker

View File

@ -0,0 +1,3 @@
# Place maker

View File

@ -0,0 +1,3 @@
# Place maker

View File

@ -0,0 +1,3 @@
# Place maker

View File

@ -0,0 +1,3 @@
# Place maker