Documentatiion update WIP
parent
dfdfd716dd
commit
d873cf3abb
56
README.md
56
README.md
|
|
@ -2,33 +2,46 @@
|
||||||
|
|
||||||
**marti** stands for metadata reconcilation for transfer information.
|
**marti** stands for metadata reconcilation for transfer information.
|
||||||
|
|
||||||
The objective is the provide transfer information for high volume data such as
|
The objective is to provide transfer information for high volume data such as
|
||||||
in files. The document (files) can be transferred via HTTPS, SFTP, message queue,
|
in files. The files can be transferred via HTTPS, SFTP, message queue,
|
||||||
network share or other. The transfer information being described here does not
|
network share or other. The transfer information being described here does not
|
||||||
need to arrive via the same channel and could be received via email or
|
need to arrive via the same channel and could be received via email or
|
||||||
even synchronous / asynchronous API. The transfer information does not dictate or
|
even synchronous / asynchronous API. The transfer information does not dictate or
|
||||||
determine how the data is formatted.
|
determine how the data is formatted.
|
||||||
|
|
||||||
The transfer information can provide details on the document format, but in itself
|
The transfer information can provide details on the file format, but in itself
|
||||||
it does not understand the data fomrat.
|
it does not understand the data format.
|
||||||
|
|
||||||
|
**Note**: The terms file and document are intended to be interchangeable
|
||||||
|
through out this documentation.
|
||||||
|
|
||||||
**marti** is intended to provide minimum basic information on the transfer with
|
**marti** is intended to provide minimum basic information on the transfer with
|
||||||
ability to include additional optional information. The metadata reconcilation
|
ability to include optional information. The metadata reconcilation
|
||||||
transfer document being decscribed here wil be referred to as the [Marti](Marti.md)
|
transfer document being decscribed here wil be referred to as the [marti document](Marti.md)
|
||||||
document throughout this documentation.
|
throughout this documentation.
|
||||||
|
|
||||||
The information is supplied as a separate document which could be another file
|
The transfer information is supplied as a separate document which could be another file
|
||||||
or supplied via API by the publisher notifying the consumer(s).
|
or supplied via API by the publisher notifying the consumer(s).
|
||||||
|
|
||||||
## Tools and Scenarios
|
## Tools and Scenarios
|
||||||
|
|
||||||
Tools and code snippets are provided to generate the information and then
|
Tools and code snippets are provided to generate the transfer information and then
|
||||||
assist in reconcile the document contents once received. Refer to the
|
assist in reconciling the document contents once received. Refer to the
|
||||||
programming folders for more details or [Tools](tools.md) for more general
|
[source programming folders](source/) for more details or [Tools](tools.md) for more general
|
||||||
information
|
information
|
||||||
|
|
||||||
|
[!div class="op_single_selector"]
|
||||||
|
- [Java](source/java/README.md)
|
||||||
|
- [golang](source/golang/README.md)
|
||||||
|
- [python](source/python/README.md)
|
||||||
|
- [powershell](source/powershell/README.md)
|
||||||
|
- [docker](source/docker/README.md)
|
||||||
|
|
||||||
## Transfer information
|
## Transfer information
|
||||||
|
|
||||||
|
The information in the **marti** document is summarised below. For more detailed
|
||||||
|
information see [marti definition](marti.md)
|
||||||
|
|
||||||
### Mandatory information
|
### Mandatory information
|
||||||
|
|
||||||
The mandatory information is:
|
The mandatory information is:
|
||||||
|
|
@ -37,10 +50,9 @@ The mandatory information is:
|
||||||
* Unique identifier
|
* Unique identifier
|
||||||
* Distribution list - See Distribution section summary below or detailed document [Distribution](docs/distribution.md)
|
* Distribution list - See Distribution section summary below or detailed document [Distribution](docs/distribution.md)
|
||||||
|
|
||||||
|
|
||||||
### Optional information
|
### Optional information
|
||||||
|
|
||||||
The option information is:
|
The optional information is:
|
||||||
|
|
||||||
* Description
|
* Description
|
||||||
* Modified
|
* Modified
|
||||||
|
|
@ -59,18 +71,20 @@ The option information is:
|
||||||
|
|
||||||
### Information extension
|
### Information extension
|
||||||
|
|
||||||
The information supplied can be extended by agreeing parties and there
|
The information supplied can be extended by party agreement and there
|
||||||
are place holders in the defintion.
|
are place holders in the defintion.
|
||||||
|
|
||||||
### Distribution
|
### Distribution
|
||||||
|
|
||||||
The distribution section can be repeated, but at least one must be included.
|
The distribution section is intended to allow multiple data files to be
|
||||||
If the distribution is repeated it will comonly be for definiting
|
grouped together. The distribution section can be repeated, but at least
|
||||||
multiple formats of the same data.
|
one must be included. If the distribution is repeated it will commonly
|
||||||
|
be for definiting multiple formats of the same data or batching of
|
||||||
|
different data together from the same extract process.
|
||||||
|
|
||||||
* Title
|
* Title
|
||||||
* Unique identifier
|
* Unique identifier
|
||||||
* Document name - If no download URL, then this will be the document name
|
* Document name
|
||||||
* Issued date - When the document was made available. The date can include time
|
* Issued date - When the document was made available. The date can include time
|
||||||
* Modified - When the document was created or modified. This is the data and time
|
* Modified - When the document was created or modified. This is the data and time
|
||||||
* Size of document - The document size in bytes
|
* Size of document - The document size in bytes
|
||||||
|
|
@ -79,8 +93,8 @@ multiple formats of the same data.
|
||||||
|
|
||||||
### Distribution optional
|
### Distribution optional
|
||||||
|
|
||||||
The following are some of the optional items in the distribution section. See [Distribution](dstribution.md)
|
The following are some of the optional items in the distribution section. See [Distribution](docs/distribution.md)
|
||||||
for more items and details
|
for more details
|
||||||
|
|
||||||
* Description
|
* Description
|
||||||
* Download URL
|
* Download URL
|
||||||
|
|
@ -88,5 +102,3 @@ for more items and details
|
||||||
* Format
|
* Format
|
||||||
* Compression
|
* Compression
|
||||||
* Encryption
|
* Encryption
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -20,3 +20,6 @@ can adjust if they resonate with your circumstances,
|
||||||
5. [Load quality metrics support](quality.md)
|
5. [Load quality metrics support](quality.md)
|
||||||
6. [Comparison of marti definition](comparison.md)
|
6. [Comparison of marti definition](comparison.md)
|
||||||
7. [References](references.md)
|
7. [References](references.md)
|
||||||
|
|
||||||
|
|
||||||
|
[!INCLUDE [marti High Level Definition](../marti.md)]
|
||||||
|
|
|
||||||
|
|
@ -1,4 +1,42 @@
|
||||||
# Comparison of marti definition
|
# Comparison of marti document definition
|
||||||
|
|
||||||
|
The use of metadata definitions is not unique and examples
|
||||||
|
exist in many different situations. Some are standard and open
|
||||||
|
while others are closed.
|
||||||
|
|
||||||
|
Some open standards are EXIF data for pictures, SQL DDL defintions
|
||||||
|
for databases, the XMP definition and web header responses before the
|
||||||
|
web content.
|
||||||
|
|
||||||
|
The **marti** document definition is intended to cover the situation
|
||||||
|
where data files are being transferred and reconciliation is required.
|
||||||
|
|
||||||
|
The **marti** document definition is modelled on the [CKAN API metadata](https://docs.ckan.org/en/2.9/api/index.html)
|
||||||
|
which has been adapted to included additional elements relevant to when
|
||||||
|
you are exchanging data files. This includes the reconciliation elements
|
||||||
|
such as number of records and file hash.
|
||||||
|
|
||||||
|
As the definition is based on the CKAN API, there are tools to import
|
||||||
|
a CKAN source into a **marti** document definition and then process the data
|
||||||
|
through the pipeline as you would for any other data file that had a
|
||||||
|
**marti** document definition.
|
||||||
|
|
||||||
|
## Benefit of CKAN and marti
|
||||||
|
|
||||||
|
The CKAN is excellent at defining the data source details but it lacks information
|
||||||
|
for load quality. If you have CKAN deployed in your organisation and wish
|
||||||
|
exhange or process the data referenced in CKAN, then there are synergies between
|
||||||
|
CKAN and marti.
|
||||||
|
|
||||||
|
Samples exist on CKAN integration.
|
||||||
|
|
||||||
|
## Magda and marti
|
||||||
|
|
||||||
|
Another source of data is [Magda](https://magda.io/) which has API metadata
|
||||||
|
definitions. Magda is more about data fedaration and as such provides
|
||||||
|
functionality on finding data sources and describing the contents.
|
||||||
|
|
||||||
|
The Magda software is able to generate APIs and data content. This does not
|
||||||
|
address the needs of data processing pipeline when reconciliation is required.
|
||||||
|
|
||||||
|
If you have Magda data sources then synergies exist between Magda and marti.
|
||||||
|
|
|
||||||
|
|
@ -1,67 +1,72 @@
|
||||||
# Distribution definition
|
# Distribution definition
|
||||||
|
|
||||||
The distrubution definition describes a single document, though
|
The distrubution section defines the files that are grouped
|
||||||
some documents may expand to multiple documents if they are
|
together by association. This association is not defined but can
|
||||||
compressed with a utility such as WinZIP or 7ZIP
|
include different formats of the same data or a common batch extract
|
||||||
|
such as end of day.
|
||||||
|
|
||||||
|
Some files may expand to multiple files if they are
|
||||||
|
compressed with a utility such as WinZIP or 7ZIP. In the situation
|
||||||
|
where a ZIP file expands to multiple documents, then the expectation is
|
||||||
|
that the ZIP file contains a **marti** document describing its contents.
|
||||||
|
|
||||||
|
The elements in the distribution section are:
|
||||||
|
|
||||||
* Title
|
* Title
|
||||||
* Document name - Commonly being absolute or relative file name.
|
* Document name - Commonly being absolute or relative file name.
|
||||||
This value could also be an URL address or network path
|
This value could also be an URL address or network path
|
||||||
* Issued date - When the document was made available. The date can include time
|
* Issued date - When the document was made available. The date can include time
|
||||||
* Modified - When the document was created or modified. This is the data and time
|
* Modified - When the document was created or modified. This is the data and time
|
||||||
* Size of document - The document size in bytes
|
* Size of file - The file size in bytes
|
||||||
* Hash of document - The hash of the document, which can be blank especially for large documents
|
* Hash of file - The hash of the file, which can be blank especially for large files
|
||||||
* Hash algorithm
|
* Hash algorithm
|
||||||
|
|
||||||
|
|
||||||
The following are optional in the distribution section.
|
The following are optional in the distribution section.
|
||||||
|
|
||||||
* Identifier
|
* Identifier
|
||||||
* Description
|
* Description
|
||||||
* Download URL
|
* Download URL
|
||||||
* Version - Document version. The same document coudl be updated or this might denote the next version
|
* Version - File version. The same file could be updated or this might denote the next version
|
||||||
of a regular report. For example a daily extract will have the version number incremented
|
of a regular report. For example a daily extract will have the version number incremented
|
||||||
every day and provide a new URL. The previous document can be retained.
|
every day and provide a new URL. The previous file can be retained.
|
||||||
* Format - if not specified then the consumer will in all likelihood use the document extension / mime type
|
* Format - if not specified then the consumer will in all likelihood use the file extension / mime type
|
||||||
* Media Type
|
* Media Type
|
||||||
* Expiry Date - The date and time that this document expires and can be removed from the download URL
|
* Expiry Date - The date and time that this file expires and can be removed from the download URL
|
||||||
location. This is not the document retention period as might be required for archiving.
|
location. This is not the file retention period as might be required for archiving.
|
||||||
* Described By - A link to the metadata describing this document data and format
|
* Described By - A link to the metadata describing this file data and format
|
||||||
* Compression - Type of compression used if any
|
* Compression - Type of compression used if any
|
||||||
* Encryption - Type of encryption used if any
|
* Encryption - Type of encryption used if any
|
||||||
|
|
||||||
|
|
||||||
## Compression
|
## Compression
|
||||||
|
|
||||||
Documents can be compressed using a utility. A single compressed document can contain
|
Files can be compressed using a utility. A single compressed file can contain
|
||||||
multiple documents. The Marti definition document applies to the compressed document
|
multiple files. The **marti** definition document applies to the compressed file
|
||||||
and not to the contents, which could be multiple documents.
|
and not to the contents, which could be multiple files.
|
||||||
|
|
||||||
In the case of a compressed document, there should be a Marti definition document in the
|
In the case of a compressed files, there should be a **marti** definition document in the
|
||||||
compressed document to match the data document. That is the number of the records in a
|
compressed file.
|
||||||
compressed document should always be an even number.
|
|
||||||
|
|
||||||
Compression of documents always occur before encryption.
|
Compression of files always occur before encryption.
|
||||||
|
|
||||||
### Marti definition for Compressed Document
|
### Marti definition for Compressed File
|
||||||
|
|
||||||
For a compressed document that is not encrypted, the distribution definition will be:
|
For a compressed file that is not encrypted, the distribution definition will be:
|
||||||
|
|
||||||
* Title - The compressed document title which could be a group name
|
* Title - The compressed file title which could be a group name
|
||||||
* Document name - Commonly being absolute or relative file name.
|
* Document name - Commonly being absolute or relative file name.
|
||||||
This value could also be an URL address or network path
|
This value could also be an URL address or network path
|
||||||
* Issued date - When the compressed document was made available.
|
* Issued date - When the compressed file was made available.
|
||||||
* Modified - When the compressed document was created or modified. This is the data and time
|
* Modified - When the compressed file was created or modified. This is the date and time
|
||||||
and is not the modified date of the document in the compressed document.
|
and is not the modified date of the file in the compressed file.
|
||||||
* Size of document - The compressed document size in bytes
|
* Size of file - The compressed file size in bytes
|
||||||
* Hash of document - The hash of the compressed document, which can be
|
* Hash of file - The hash of the compressed file, which can be
|
||||||
blank especially for large documents
|
blank especially for large files
|
||||||
* Hash algorithm
|
* Hash algorithm
|
||||||
|
|
||||||
The reason for this approach is it allows a generic tool to be deployed to
|
The reason for this approach is it allows a generic tool to be deployed to
|
||||||
check the validity of the contents without unpacking the received /fetched
|
check the validity of the contents without unpacking the received /fetched
|
||||||
document. That is you can perform load quality pipeline processing.
|
file. That is you can perform load quality pipeline processing.
|
||||||
|
|
||||||
## Encryption
|
## Encryption
|
||||||
|
|
||||||
|
|
@ -72,22 +77,22 @@ provide encryption within the tool execution.
|
||||||
If the compression is TAR or GZIP then you may consider applying a GPG
|
If the compression is TAR or GZIP then you may consider applying a GPG
|
||||||
or other encryption algorithm to the compressed file.
|
or other encryption algorithm to the compressed file.
|
||||||
|
|
||||||
* Title - The encrypted document title
|
* Title - The encrypted file title
|
||||||
* Document name - Commonly being absolute or relative file name.
|
* Document name - Commonly being absolute or relative file name.
|
||||||
This value could also be an URL address or network path
|
This value could also be an URL address or network path
|
||||||
* Issued date - When the **encrypted** document was made available.
|
* Issued date - When the **encrypted** file was made available.
|
||||||
* Modified - When the **encrypted** document was created or modified.
|
* Modified - When the **encrypted** file was created or modified.
|
||||||
This is the data and time and is not the modified date of the encrypted document.
|
This is the data and time and is not the modified date of the encrypted file.
|
||||||
* Size of document - The **decrypted** document size in bytes
|
* Size of file - The **decrypted** file size in bytes
|
||||||
* Hash of document - The hash of the **decrypted** document, which can be
|
* Hash of file - The hash of the **decrypted** file, which can be
|
||||||
blank especially for large documents
|
blank especially for large files
|
||||||
* Hash algorithm
|
* Hash algorithm
|
||||||
|
|
||||||
The rational for using the decrypted document attributes is that an ecrypted
|
The rational for using the decrypted file attributes is that an ecrypted
|
||||||
document is unlikely to be able to be modified without knowing encryption keys.
|
file is unlikely to be able to be modified without knowing encryption keys.
|
||||||
Checking the decrypted document attributes is a better check wheer appropriate.
|
Checking the decrypted fille attributes is a better check.
|
||||||
|
|
||||||
The reason for this approach is it allows a generic tool to be deployed to
|
The reason for this approach is it allows a generic tool to be deployed to
|
||||||
decrypt and check the validity of the received / fetched document without
|
decrypt and check the validity of the received / fetched file without
|
||||||
needing to understand the contents. That is you can perform load quality
|
needing to understand the contents. That is you can perform load quality
|
||||||
pipeline processing.
|
pipeline processing.
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,5 @@
|
||||||
# Contents
|
# Contents
|
||||||
|
|
||||||
This is the parent directory for **mati** tools written in various languages.
|
This is the parent directory for **marti** tools written in various languages.
|
||||||
|
|
||||||
Please browse the folder for the programming language of interest.
|
Please browse the folder for the programming language of interest.
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
# Place maker
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
# Place maker
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
# Place maker
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
# Place maker
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
# Place maker
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
# Place maker
|
||||||
|
|
||||||
|
|
||||||
Loading…
Reference in New Issue