Documentatiion update WIP
parent
dfdfd716dd
commit
d873cf3abb
56
README.md
56
README.md
|
|
@ -2,33 +2,46 @@
|
|||
|
||||
**marti** stands for metadata reconcilation for transfer information.
|
||||
|
||||
The objective is the provide transfer information for high volume data such as
|
||||
in files. The document (files) can be transferred via HTTPS, SFTP, message queue,
|
||||
The objective is to provide transfer information for high volume data such as
|
||||
in files. The files can be transferred via HTTPS, SFTP, message queue,
|
||||
network share or other. The transfer information being described here does not
|
||||
need to arrive via the same channel and could be received via email or
|
||||
even synchronous / asynchronous API. The transfer information does not dictate or
|
||||
determine how the data is formatted.
|
||||
|
||||
The transfer information can provide details on the document format, but in itself
|
||||
it does not understand the data fomrat.
|
||||
The transfer information can provide details on the file format, but in itself
|
||||
it does not understand the data format.
|
||||
|
||||
**Note**: The terms file and document are intended to be interchangeable
|
||||
through out this documentation.
|
||||
|
||||
**marti** is intended to provide minimum basic information on the transfer with
|
||||
ability to include additional optional information. The metadata reconcilation
|
||||
transfer document being decscribed here wil be referred to as the [Marti](Marti.md)
|
||||
document throughout this documentation.
|
||||
ability to include optional information. The metadata reconcilation
|
||||
transfer document being decscribed here wil be referred to as the [marti document](Marti.md)
|
||||
throughout this documentation.
|
||||
|
||||
The information is supplied as a separate document which could be another file
|
||||
The transfer information is supplied as a separate document which could be another file
|
||||
or supplied via API by the publisher notifying the consumer(s).
|
||||
|
||||
## Tools and Scenarios
|
||||
|
||||
Tools and code snippets are provided to generate the information and then
|
||||
assist in reconcile the document contents once received. Refer to the
|
||||
programming folders for more details or [Tools](tools.md) for more general
|
||||
Tools and code snippets are provided to generate the transfer information and then
|
||||
assist in reconciling the document contents once received. Refer to the
|
||||
[source programming folders](source/) for more details or [Tools](tools.md) for more general
|
||||
information
|
||||
|
||||
[!div class="op_single_selector"]
|
||||
- [Java](source/java/README.md)
|
||||
- [golang](source/golang/README.md)
|
||||
- [python](source/python/README.md)
|
||||
- [powershell](source/powershell/README.md)
|
||||
- [docker](source/docker/README.md)
|
||||
|
||||
## Transfer information
|
||||
|
||||
The information in the **marti** document is summarised below. For more detailed
|
||||
information see [marti definition](marti.md)
|
||||
|
||||
### Mandatory information
|
||||
|
||||
The mandatory information is:
|
||||
|
|
@ -37,10 +50,9 @@ The mandatory information is:
|
|||
* Unique identifier
|
||||
* Distribution list - See Distribution section summary below or detailed document [Distribution](docs/distribution.md)
|
||||
|
||||
|
||||
### Optional information
|
||||
|
||||
The option information is:
|
||||
The optional information is:
|
||||
|
||||
* Description
|
||||
* Modified
|
||||
|
|
@ -59,18 +71,20 @@ The option information is:
|
|||
|
||||
### Information extension
|
||||
|
||||
The information supplied can be extended by agreeing parties and there
|
||||
The information supplied can be extended by party agreement and there
|
||||
are place holders in the defintion.
|
||||
|
||||
### Distribution
|
||||
|
||||
The distribution section can be repeated, but at least one must be included.
|
||||
If the distribution is repeated it will comonly be for definiting
|
||||
multiple formats of the same data.
|
||||
The distribution section is intended to allow multiple data files to be
|
||||
grouped together. The distribution section can be repeated, but at least
|
||||
one must be included. If the distribution is repeated it will commonly
|
||||
be for definiting multiple formats of the same data or batching of
|
||||
different data together from the same extract process.
|
||||
|
||||
* Title
|
||||
* Unique identifier
|
||||
* Document name - If no download URL, then this will be the document name
|
||||
* Document name
|
||||
* Issued date - When the document was made available. The date can include time
|
||||
* Modified - When the document was created or modified. This is the data and time
|
||||
* Size of document - The document size in bytes
|
||||
|
|
@ -79,8 +93,8 @@ multiple formats of the same data.
|
|||
|
||||
### Distribution optional
|
||||
|
||||
The following are some of the optional items in the distribution section. See [Distribution](dstribution.md)
|
||||
for more items and details
|
||||
The following are some of the optional items in the distribution section. See [Distribution](docs/distribution.md)
|
||||
for more details
|
||||
|
||||
* Description
|
||||
* Download URL
|
||||
|
|
@ -88,5 +102,3 @@ for more items and details
|
|||
* Format
|
||||
* Compression
|
||||
* Encryption
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -20,3 +20,6 @@ can adjust if they resonate with your circumstances,
|
|||
5. [Load quality metrics support](quality.md)
|
||||
6. [Comparison of marti definition](comparison.md)
|
||||
7. [References](references.md)
|
||||
|
||||
|
||||
[!INCLUDE [marti High Level Definition](../marti.md)]
|
||||
|
|
|
|||
|
|
@ -1,4 +1,42 @@
|
|||
# Comparison of marti definition
|
||||
# Comparison of marti document definition
|
||||
|
||||
The use of metadata definitions is not unique and examples
|
||||
exist in many different situations. Some are standard and open
|
||||
while others are closed.
|
||||
|
||||
Some open standards are EXIF data for pictures, SQL DDL defintions
|
||||
for databases, the XMP definition and web header responses before the
|
||||
web content.
|
||||
|
||||
The **marti** document definition is intended to cover the situation
|
||||
where data files are being transferred and reconciliation is required.
|
||||
|
||||
The **marti** document definition is modelled on the [CKAN API metadata](https://docs.ckan.org/en/2.9/api/index.html)
|
||||
which has been adapted to included additional elements relevant to when
|
||||
you are exchanging data files. This includes the reconciliation elements
|
||||
such as number of records and file hash.
|
||||
|
||||
As the definition is based on the CKAN API, there are tools to import
|
||||
a CKAN source into a **marti** document definition and then process the data
|
||||
through the pipeline as you would for any other data file that had a
|
||||
**marti** document definition.
|
||||
|
||||
## Benefit of CKAN and marti
|
||||
|
||||
The CKAN is excellent at defining the data source details but it lacks information
|
||||
for load quality. If you have CKAN deployed in your organisation and wish
|
||||
exhange or process the data referenced in CKAN, then there are synergies between
|
||||
CKAN and marti.
|
||||
|
||||
Samples exist on CKAN integration.
|
||||
|
||||
## Magda and marti
|
||||
|
||||
Another source of data is [Magda](https://magda.io/) which has API metadata
|
||||
definitions. Magda is more about data fedaration and as such provides
|
||||
functionality on finding data sources and describing the contents.
|
||||
|
||||
The Magda software is able to generate APIs and data content. This does not
|
||||
address the needs of data processing pipeline when reconciliation is required.
|
||||
|
||||
If you have Magda data sources then synergies exist between Magda and marti.
|
||||
|
|
|
|||
|
|
@ -1,67 +1,72 @@
|
|||
# Distribution definition
|
||||
|
||||
The distrubution definition describes a single document, though
|
||||
some documents may expand to multiple documents if they are
|
||||
compressed with a utility such as WinZIP or 7ZIP
|
||||
The distrubution section defines the files that are grouped
|
||||
together by association. This association is not defined but can
|
||||
include different formats of the same data or a common batch extract
|
||||
such as end of day.
|
||||
|
||||
Some files may expand to multiple files if they are
|
||||
compressed with a utility such as WinZIP or 7ZIP. In the situation
|
||||
where a ZIP file expands to multiple documents, then the expectation is
|
||||
that the ZIP file contains a **marti** document describing its contents.
|
||||
|
||||
The elements in the distribution section are:
|
||||
|
||||
* Title
|
||||
* Document name - Commonly being absolute or relative file name.
|
||||
This value could also be an URL address or network path
|
||||
* Issued date - When the document was made available. The date can include time
|
||||
* Modified - When the document was created or modified. This is the data and time
|
||||
* Size of document - The document size in bytes
|
||||
* Hash of document - The hash of the document, which can be blank especially for large documents
|
||||
* Size of file - The file size in bytes
|
||||
* Hash of file - The hash of the file, which can be blank especially for large files
|
||||
* Hash algorithm
|
||||
|
||||
|
||||
The following are optional in the distribution section.
|
||||
|
||||
* Identifier
|
||||
* Description
|
||||
* Download URL
|
||||
* Version - Document version. The same document coudl be updated or this might denote the next version
|
||||
* Version - File version. The same file could be updated or this might denote the next version
|
||||
of a regular report. For example a daily extract will have the version number incremented
|
||||
every day and provide a new URL. The previous document can be retained.
|
||||
* Format - if not specified then the consumer will in all likelihood use the document extension / mime type
|
||||
every day and provide a new URL. The previous file can be retained.
|
||||
* Format - if not specified then the consumer will in all likelihood use the file extension / mime type
|
||||
* Media Type
|
||||
* Expiry Date - The date and time that this document expires and can be removed from the download URL
|
||||
location. This is not the document retention period as might be required for archiving.
|
||||
* Described By - A link to the metadata describing this document data and format
|
||||
* Expiry Date - The date and time that this file expires and can be removed from the download URL
|
||||
location. This is not the file retention period as might be required for archiving.
|
||||
* Described By - A link to the metadata describing this file data and format
|
||||
* Compression - Type of compression used if any
|
||||
* Encryption - Type of encryption used if any
|
||||
|
||||
|
||||
## Compression
|
||||
|
||||
Documents can be compressed using a utility. A single compressed document can contain
|
||||
multiple documents. The Marti definition document applies to the compressed document
|
||||
and not to the contents, which could be multiple documents.
|
||||
Files can be compressed using a utility. A single compressed file can contain
|
||||
multiple files. The **marti** definition document applies to the compressed file
|
||||
and not to the contents, which could be multiple files.
|
||||
|
||||
In the case of a compressed document, there should be a Marti definition document in the
|
||||
compressed document to match the data document. That is the number of the records in a
|
||||
compressed document should always be an even number.
|
||||
In the case of a compressed files, there should be a **marti** definition document in the
|
||||
compressed file.
|
||||
|
||||
Compression of documents always occur before encryption.
|
||||
Compression of files always occur before encryption.
|
||||
|
||||
### Marti definition for Compressed Document
|
||||
### Marti definition for Compressed File
|
||||
|
||||
For a compressed document that is not encrypted, the distribution definition will be:
|
||||
For a compressed file that is not encrypted, the distribution definition will be:
|
||||
|
||||
* Title - The compressed document title which could be a group name
|
||||
* Title - The compressed file title which could be a group name
|
||||
* Document name - Commonly being absolute or relative file name.
|
||||
This value could also be an URL address or network path
|
||||
* Issued date - When the compressed document was made available.
|
||||
* Modified - When the compressed document was created or modified. This is the data and time
|
||||
and is not the modified date of the document in the compressed document.
|
||||
* Size of document - The compressed document size in bytes
|
||||
* Hash of document - The hash of the compressed document, which can be
|
||||
blank especially for large documents
|
||||
* Issued date - When the compressed file was made available.
|
||||
* Modified - When the compressed file was created or modified. This is the date and time
|
||||
and is not the modified date of the file in the compressed file.
|
||||
* Size of file - The compressed file size in bytes
|
||||
* Hash of file - The hash of the compressed file, which can be
|
||||
blank especially for large files
|
||||
* Hash algorithm
|
||||
|
||||
The reason for this approach is it allows a generic tool to be deployed to
|
||||
check the validity of the contents without unpacking the received /fetched
|
||||
document. That is you can perform load quality pipeline processing.
|
||||
file. That is you can perform load quality pipeline processing.
|
||||
|
||||
## Encryption
|
||||
|
||||
|
|
@ -72,22 +77,22 @@ provide encryption within the tool execution.
|
|||
If the compression is TAR or GZIP then you may consider applying a GPG
|
||||
or other encryption algorithm to the compressed file.
|
||||
|
||||
* Title - The encrypted document title
|
||||
* Title - The encrypted file title
|
||||
* Document name - Commonly being absolute or relative file name.
|
||||
This value could also be an URL address or network path
|
||||
* Issued date - When the **encrypted** document was made available.
|
||||
* Modified - When the **encrypted** document was created or modified.
|
||||
This is the data and time and is not the modified date of the encrypted document.
|
||||
* Size of document - The **decrypted** document size in bytes
|
||||
* Hash of document - The hash of the **decrypted** document, which can be
|
||||
blank especially for large documents
|
||||
* Issued date - When the **encrypted** file was made available.
|
||||
* Modified - When the **encrypted** file was created or modified.
|
||||
This is the data and time and is not the modified date of the encrypted file.
|
||||
* Size of file - The **decrypted** file size in bytes
|
||||
* Hash of file - The hash of the **decrypted** file, which can be
|
||||
blank especially for large files
|
||||
* Hash algorithm
|
||||
|
||||
The rational for using the decrypted document attributes is that an ecrypted
|
||||
document is unlikely to be able to be modified without knowing encryption keys.
|
||||
Checking the decrypted document attributes is a better check wheer appropriate.
|
||||
The rational for using the decrypted file attributes is that an ecrypted
|
||||
file is unlikely to be able to be modified without knowing encryption keys.
|
||||
Checking the decrypted fille attributes is a better check.
|
||||
|
||||
The reason for this approach is it allows a generic tool to be deployed to
|
||||
decrypt and check the validity of the received / fetched document without
|
||||
decrypt and check the validity of the received / fetched file without
|
||||
needing to understand the contents. That is you can perform load quality
|
||||
pipeline processing.
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
# Contents
|
||||
|
||||
This is the parent directory for **mati** tools written in various languages.
|
||||
This is the parent directory for **marti** tools written in various languages.
|
||||
|
||||
Please browse the folder for the programming language of interest.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,3 @@
|
|||
# Place maker
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
# Place maker
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
# Place maker
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
# Place maker
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
# Place maker
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
# Place maker
|
||||
|
||||
|
||||
Loading…
Reference in New Issue