From d873cf3abb8c66b2dc4a17417d26f50662332ccf Mon Sep 17 00:00:00 2001 From: meerkat Date: Sun, 10 Oct 2021 13:07:46 +1100 Subject: [PATCH] Documentatiion update WIP --- README.md | 56 ++++++++++++++---------- docs/README.md | 3 ++ docs/comparison.md | 40 ++++++++++++++++- docs/distribution.md | 85 ++++++++++++++++++++----------------- source/README.md | 2 +- source/docker/README.md | 3 ++ source/dotnet/README.md | 3 ++ source/golang/README.md | 3 ++ source/java/README.md | 3 ++ source/powershell/README.md | 3 ++ source/python/README.md | 3 ++ 11 files changed, 140 insertions(+), 64 deletions(-) create mode 100644 source/docker/README.md create mode 100644 source/dotnet/README.md create mode 100644 source/golang/README.md create mode 100644 source/java/README.md create mode 100644 source/powershell/README.md create mode 100644 source/python/README.md diff --git a/README.md b/README.md index e6020f3..681472e 100644 --- a/README.md +++ b/README.md @@ -2,33 +2,46 @@ **marti** stands for metadata reconcilation for transfer information. -The objective is the provide transfer information for high volume data such as -in files. The document (files) can be transferred via HTTPS, SFTP, message queue, +The objective is to provide transfer information for high volume data such as +in files. The files can be transferred via HTTPS, SFTP, message queue, network share or other. The transfer information being described here does not need to arrive via the same channel and could be received via email or even synchronous / asynchronous API. The transfer information does not dictate or determine how the data is formatted. -The transfer information can provide details on the document format, but in itself -it does not understand the data fomrat. +The transfer information can provide details on the file format, but in itself +it does not understand the data format. + +**Note**: The terms file and document are intended to be interchangeable +through out this documentation. **marti** is intended to provide minimum basic information on the transfer with -ability to include additional optional information. The metadata reconcilation -transfer document being decscribed here wil be referred to as the [Marti](Marti.md) -document throughout this documentation. +ability to include optional information. The metadata reconcilation +transfer document being decscribed here wil be referred to as the [marti document](Marti.md) +throughout this documentation. -The information is supplied as a separate document which could be another file +The transfer information is supplied as a separate document which could be another file or supplied via API by the publisher notifying the consumer(s). ## Tools and Scenarios -Tools and code snippets are provided to generate the information and then -assist in reconcile the document contents once received. Refer to the -programming folders for more details or [Tools](tools.md) for more general +Tools and code snippets are provided to generate the transfer information and then +assist in reconciling the document contents once received. Refer to the +[source programming folders](source/) for more details or [Tools](tools.md) for more general information +[!div class="op_single_selector"] +- [Java](source/java/README.md) +- [golang](source/golang/README.md) +- [python](source/python/README.md) +- [powershell](source/powershell/README.md) +- [docker](source/docker/README.md) + ## Transfer information +The information in the **marti** document is summarised below. For more detailed +information see [marti definition](marti.md) + ### Mandatory information The mandatory information is: @@ -37,10 +50,9 @@ The mandatory information is: * Unique identifier * Distribution list - See Distribution section summary below or detailed document [Distribution](docs/distribution.md) - ### Optional information -The option information is: +The optional information is: * Description * Modified @@ -59,18 +71,20 @@ The option information is: ### Information extension -The information supplied can be extended by agreeing parties and there +The information supplied can be extended by party agreement and there are place holders in the defintion. ### Distribution -The distribution section can be repeated, but at least one must be included. -If the distribution is repeated it will comonly be for definiting -multiple formats of the same data. +The distribution section is intended to allow multiple data files to be +grouped together. The distribution section can be repeated, but at least +one must be included. If the distribution is repeated it will commonly +be for definiting multiple formats of the same data or batching of +different data together from the same extract process. * Title * Unique identifier -* Document name - If no download URL, then this will be the document name +* Document name * Issued date - When the document was made available. The date can include time * Modified - When the document was created or modified. This is the data and time * Size of document - The document size in bytes @@ -79,8 +93,8 @@ multiple formats of the same data. ### Distribution optional -The following are some of the optional items in the distribution section. See [Distribution](dstribution.md) -for more items and details +The following are some of the optional items in the distribution section. See [Distribution](docs/distribution.md) +for more details * Description * Download URL @@ -88,5 +102,3 @@ for more items and details * Format * Compression * Encryption - - diff --git a/docs/README.md b/docs/README.md index 55ba573..ff54607 100644 --- a/docs/README.md +++ b/docs/README.md @@ -20,3 +20,6 @@ can adjust if they resonate with your circumstances, 5. [Load quality metrics support](quality.md) 6. [Comparison of marti definition](comparison.md) 7. [References](references.md) + + +[!INCLUDE [marti High Level Definition](../marti.md)] diff --git a/docs/comparison.md b/docs/comparison.md index 0b35ae8..f90489c 100644 --- a/docs/comparison.md +++ b/docs/comparison.md @@ -1,4 +1,42 @@ -# Comparison of marti definition +# Comparison of marti document definition +The use of metadata definitions is not unique and examples +exist in many different situations. Some are standard and open +while others are closed. +Some open standards are EXIF data for pictures, SQL DDL defintions +for databases, the XMP definition and web header responses before the +web content. +The **marti** document definition is intended to cover the situation +where data files are being transferred and reconciliation is required. + +The **marti** document definition is modelled on the [CKAN API metadata](https://docs.ckan.org/en/2.9/api/index.html) +which has been adapted to included additional elements relevant to when +you are exchanging data files. This includes the reconciliation elements +such as number of records and file hash. + +As the definition is based on the CKAN API, there are tools to import +a CKAN source into a **marti** document definition and then process the data +through the pipeline as you would for any other data file that had a +**marti** document definition. + +## Benefit of CKAN and marti + +The CKAN is excellent at defining the data source details but it lacks information +for load quality. If you have CKAN deployed in your organisation and wish +exhange or process the data referenced in CKAN, then there are synergies between +CKAN and marti. + +Samples exist on CKAN integration. + +## Magda and marti + +Another source of data is [Magda](https://magda.io/) which has API metadata +definitions. Magda is more about data fedaration and as such provides +functionality on finding data sources and describing the contents. + +The Magda software is able to generate APIs and data content. This does not +address the needs of data processing pipeline when reconciliation is required. + +If you have Magda data sources then synergies exist between Magda and marti. diff --git a/docs/distribution.md b/docs/distribution.md index dd627b0..eec2284 100644 --- a/docs/distribution.md +++ b/docs/distribution.md @@ -1,67 +1,72 @@ # Distribution definition -The distrubution definition describes a single document, though -some documents may expand to multiple documents if they are -compressed with a utility such as WinZIP or 7ZIP +The distrubution section defines the files that are grouped +together by association. This association is not defined but can +include different formats of the same data or a common batch extract +such as end of day. +Some files may expand to multiple files if they are +compressed with a utility such as WinZIP or 7ZIP. In the situation +where a ZIP file expands to multiple documents, then the expectation is +that the ZIP file contains a **marti** document describing its contents. + +The elements in the distribution section are: * Title * Document name - Commonly being absolute or relative file name. This value could also be an URL address or network path * Issued date - When the document was made available. The date can include time * Modified - When the document was created or modified. This is the data and time -* Size of document - The document size in bytes -* Hash of document - The hash of the document, which can be blank especially for large documents +* Size of file - The file size in bytes +* Hash of file - The hash of the file, which can be blank especially for large files * Hash algorithm - The following are optional in the distribution section. * Identifier * Description * Download URL -* Version - Document version. The same document coudl be updated or this might denote the next version +* Version - File version. The same file could be updated or this might denote the next version of a regular report. For example a daily extract will have the version number incremented - every day and provide a new URL. The previous document can be retained. -* Format - if not specified then the consumer will in all likelihood use the document extension / mime type + every day and provide a new URL. The previous file can be retained. +* Format - if not specified then the consumer will in all likelihood use the file extension / mime type * Media Type -* Expiry Date - The date and time that this document expires and can be removed from the download URL - location. This is not the document retention period as might be required for archiving. -* Described By - A link to the metadata describing this document data and format +* Expiry Date - The date and time that this file expires and can be removed from the download URL + location. This is not the file retention period as might be required for archiving. +* Described By - A link to the metadata describing this file data and format * Compression - Type of compression used if any * Encryption - Type of encryption used if any ## Compression -Documents can be compressed using a utility. A single compressed document can contain -multiple documents. The Marti definition document applies to the compressed document -and not to the contents, which could be multiple documents. +Files can be compressed using a utility. A single compressed file can contain +multiple files. The **marti** definition document applies to the compressed file +and not to the contents, which could be multiple files. -In the case of a compressed document, there should be a Marti definition document in the -compressed document to match the data document. That is the number of the records in a -compressed document should always be an even number. +In the case of a compressed files, there should be a **marti** definition document in the +compressed file. -Compression of documents always occur before encryption. +Compression of files always occur before encryption. -### Marti definition for Compressed Document +### Marti definition for Compressed File -For a compressed document that is not encrypted, the distribution definition will be: +For a compressed file that is not encrypted, the distribution definition will be: -* Title - The compressed document title which could be a group name +* Title - The compressed file title which could be a group name * Document name - Commonly being absolute or relative file name. This value could also be an URL address or network path -* Issued date - When the compressed document was made available. -* Modified - When the compressed document was created or modified. This is the data and time - and is not the modified date of the document in the compressed document. -* Size of document - The compressed document size in bytes -* Hash of document - The hash of the compressed document, which can be - blank especially for large documents +* Issued date - When the compressed file was made available. +* Modified - When the compressed file was created or modified. This is the date and time + and is not the modified date of the file in the compressed file. +* Size of file - The compressed file size in bytes +* Hash of file - The hash of the compressed file, which can be + blank especially for large files * Hash algorithm The reason for this approach is it allows a generic tool to be deployed to check the validity of the contents without unpacking the received /fetched -document. That is you can perform load quality pipeline processing. +file. That is you can perform load quality pipeline processing. ## Encryption @@ -72,22 +77,22 @@ provide encryption within the tool execution. If the compression is TAR or GZIP then you may consider applying a GPG or other encryption algorithm to the compressed file. -* Title - The encrypted document title +* Title - The encrypted file title * Document name - Commonly being absolute or relative file name. This value could also be an URL address or network path -* Issued date - When the **encrypted** document was made available. -* Modified - When the **encrypted** document was created or modified. - This is the data and time and is not the modified date of the encrypted document. -* Size of document - The **decrypted** document size in bytes -* Hash of document - The hash of the **decrypted** document, which can be - blank especially for large documents +* Issued date - When the **encrypted** file was made available. +* Modified - When the **encrypted** file was created or modified. + This is the data and time and is not the modified date of the encrypted file. +* Size of file - The **decrypted** file size in bytes +* Hash of file - The hash of the **decrypted** file, which can be + blank especially for large files * Hash algorithm -The rational for using the decrypted document attributes is that an ecrypted -document is unlikely to be able to be modified without knowing encryption keys. -Checking the decrypted document attributes is a better check wheer appropriate. +The rational for using the decrypted file attributes is that an ecrypted +file is unlikely to be able to be modified without knowing encryption keys. +Checking the decrypted fille attributes is a better check. The reason for this approach is it allows a generic tool to be deployed to -decrypt and check the validity of the received / fetched document without +decrypt and check the validity of the received / fetched file without needing to understand the contents. That is you can perform load quality pipeline processing. diff --git a/source/README.md b/source/README.md index 654b98c..50645e1 100644 --- a/source/README.md +++ b/source/README.md @@ -1,5 +1,5 @@ # Contents -This is the parent directory for **mati** tools written in various languages. +This is the parent directory for **marti** tools written in various languages. Please browse the folder for the programming language of interest. diff --git a/source/docker/README.md b/source/docker/README.md new file mode 100644 index 0000000..4536e7d --- /dev/null +++ b/source/docker/README.md @@ -0,0 +1,3 @@ +# Place maker + + diff --git a/source/dotnet/README.md b/source/dotnet/README.md new file mode 100644 index 0000000..4536e7d --- /dev/null +++ b/source/dotnet/README.md @@ -0,0 +1,3 @@ +# Place maker + + diff --git a/source/golang/README.md b/source/golang/README.md new file mode 100644 index 0000000..4536e7d --- /dev/null +++ b/source/golang/README.md @@ -0,0 +1,3 @@ +# Place maker + + diff --git a/source/java/README.md b/source/java/README.md new file mode 100644 index 0000000..4536e7d --- /dev/null +++ b/source/java/README.md @@ -0,0 +1,3 @@ +# Place maker + + diff --git a/source/powershell/README.md b/source/powershell/README.md new file mode 100644 index 0000000..4536e7d --- /dev/null +++ b/source/powershell/README.md @@ -0,0 +1,3 @@ +# Place maker + + diff --git a/source/python/README.md b/source/python/README.md new file mode 100644 index 0000000..4536e7d --- /dev/null +++ b/source/python/README.md @@ -0,0 +1,3 @@ +# Place maker + +