Uploading Data for a Data Resource

A Generic, General Purpose Research Data Repository Service.

Uploading Data for a Data Resource

After covering the basics of metadata creation and modification it’s time to associate the first file with a created data resource. Uploading data to a KIT Data Manager based repository is also done using the RESTful API. In order to inform the repository to access data or data-related metadata, you only have to append a path element ‘data’ to the URL of the actual resource. After this ‘data’ path element you are free to organize and name your data as you wish. This allows you to build filesystem-like hierarchies for each data resource. Furthermore, with each file uploaded to KIT Data Manager a metadata document is connected called ‘ContentInformation’. This document contains internal information, e.g. the relative path, the depth in the hierarchy and the parent resource, automatically generated metadata, e.g. checksum, size and mime type information, as well as custom information that can be added by the user, e.g. key-value based metadata or tags for grouping content elements. For the time being, let’s stick with the simple case of uploading a single file you can see in the following example.

$ curl 'http://localhost:8080/api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/randomFile.txt' -i -X POST \
    -H 'Content-Type: multipart/form-data' \
    -F 'file=@randomFile.txt;type=multipart/form-data'

You should pay attention to three different things: first, check the URL. You can see the base URL for the resource we’ve created before with the aforementioned ‘data’ element and a filename ‘randomFile.txt’. This will be the URL where the file is uploaded to and from where it can be downloaded later. The second part worth mentioning is the Content-Type header which is set to ‘multipart/form-data’. You always have to provide this content type while uploading data, otherwise the upload will fail. Finally, there is the file to upload. For the curl command, it must be provided in the way shown in the example with the -F command line option, the argument name ‘file’ and the reference to the local file, in this example ‘@randomFile.txt’. How you provide this argument depends on the client or the API you are using.


NOTE In the example, the local source file and the destination file name at the server are both ‘randomFile.txt’. This is not necessarily required. There can be used an arbitrary name for the file on the server independent from the original file name.


In the HTTP request you can see how the data is submitted. You have your part boundary surrounding the actual content of the file in an encoded form.

POST /api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/randomFile.txt HTTP/1.1
Content-Type: multipart/form-data; boundary=6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm
Host: localhost:8080

--6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm
Content-Disposition: form-data; name=file; filename=randomFile.txt
Content-Type: multipart/form-data

I7a7EvgrdmDPXVL0tWbqCY6iKwy0D7T0Wx4bvev5juu3Kf89VySewuvNLaUFgBB4
--6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm--

Finally, if the file has been uploaded, a response with HTTP status 201 (CREATED) is returned together with the file access URL in the Location header.

HTTP/1.1 201 Created
Location: http://localhost:8080/api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/randomFile.txt?version=1
Resource-Version: 1

Via the file location you can access both: the bitstream of the file as well as metadata collected during file ingest. Let’s first take a look at the metadata. Therefor, you have to do a simple HTTP GET to the data location telling the server that you request for metadata by adding the Accept header with the value ‘application/vnd.datamanager.content-information+json’ as you can see it in the curl command below.

$ curl 'http://localhost:8080/api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/randomFile.txt' -i -X GET \
    -H 'Accept: application/vnd.datamanager.content-information+json'

The HTTP request holds no surprises, it contains the provided target URL as well as the Accept header for requesting ContentInformation instead of data.

GET /api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/randomFile.txt HTTP/1.1
Accept: application/vnd.datamanager.content-information+json
Host: localhost:8080

With the response you receive on the one hand the current ETag and on the other hand the actual ContentInformation document for the addressed file. All information you can see in the following response is added by the server during the file ingest.

HTTP/1.1 200 OK
ETag: "-2010363665"
Resource-Version: 1
Content-Type: application/vnd.datamanager.content-information+json
Content-Length: 776

{
  "id" : 1,
  "parentResource" : {
    "id" : "edbf964c-f215-4fc6-9ef1-2ff1ea5a811e",
    "identifier" : {
      "value" : "(:tba)",
      "identifierType" : "DOI"
    },
    "alternateIdentifiers" : [ {
      "value" : "edbf964c-f215-4fc6-9ef1-2ff1ea5a811e",
      "identifierType" : "INTERNAL"
    } ]
  },
  "relativePath" : "randomFile.txt",
  "version" : 1,
  "fileVersion" : "1",
  "versioningService" : "simple",
  "depth" : 1,
  "contentUri" : "file:/tmp/repo-basepath/2022/2/25/edbf964cf2154fc69ef12ff1ea5a811e/randomFile.txt_1648212427612",
  "uploader" : "SELF",
  "mediaType" : "text/plain",
  "hash" : "sha1:4c29d7ed12eb7e0308a595ddaaf2b79a5a14bf2c",
  "size" : 64,
  "metadata" : { },
  "tags" : [ ],
  "filename" : "randomFile.txt"
}

Of course, you can also provide a ContentInformation document during upload, e.g. if you plan to add key-value metadata or tags for a file. In that case you basically do the same as before but you also add another file argument with name ‘metadata’ containing a JSON file with the ContentInformation document.

$ curl 'http://localhost:8080/api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/randomFile2.txt' -i -X POST \
    -H 'Content-Type: multipart/form-data' \
    -F 'file=@randomFile2.txt;type=multipart/form-data' \
    -F 'metadata=@metadata.json;type=application/json'

In the HTTP request you now see two content elements: the file ‘randomFile2.txt’ and the metadata ‘metadata.json’. You can choose an arbitrary name for the metadata file, but the content must match the ContentInformation model. In the example you see what’s in the metadata file. The only element that is relevant is ‘metadata’ containing a single key ‘test’ with value ‘ok’. There are also other elements in this file, e.g. depth, size and filename. These are examples for content metadata elements that cannot be set by the user. If there is a value assigned, it will be overwritten during ingest with one exception explained later.

POST /api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/randomFile2.txt HTTP/1.1
Content-Type: multipart/form-data; boundary=6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm
Host: localhost:8080

--6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm
Content-Disposition: form-data; name=file; filename=randomFile2.txt
Content-Type: multipart/form-data

I7a7EvgrdmDPXVL0tWbqCY6iKwy0D7T0Wx4bvev5juu3Kf89VySewuvNLaUFgBB4
--6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm
Content-Disposition: form-data; name=metadata; filename=metadata.json
Content-Type: application/json

{"versioningService":"none","depth":0,"size":0,"metadata":{"test":"ok"},"tags":[]}
--6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm--

The response of the upload including content information is identical to the first upload example. If an HTTP status 201 (CREATED) is returned, everything worked out fine and the file as well as the ContentInformation document can be accessed via the Location provided in the header.

HTTP/1.1 201 Created
Location: http://localhost:8080/api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/randomFile2.txt?version=1
Resource-Version: 1

Finally, in some cases, it might not be possible or desired to upload a certain file to the repository, e.g. due to size limitations or due to licensing issues. For these cases it is possible, to just register a remotely stored file at the local repository. In this scenario, the file argument is omitted during upload and only the ContentInformation metadata document is submitted containing the remote URL in the ‘contentUri’ field as you can see it in the following example.

$ curl 'http://localhost:8080/api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/referencedContent' -i -X POST \
    -H 'Content-Type: multipart/form-data' \
    -F 'metadata=@metadata.json;type=application/json'

There is another exception in this special scenario. As there is no guarantee that the file referenced by ‘contentUri’ is accessible by the repository and for performance reasons, the file is neither checked nor downloaded at ingest time. Therefore, no size or checksum information will be assigned by the system. If you are aware of these information and if you want to make them available, it is allowed to provide the size and checksum element of the ContentInformation document at upload time. They are (only in this case) not overwritten and will be available later, no matter if they are correct or not.

POST /api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/referencedContent HTTP/1.1
Content-Type: multipart/form-data; boundary=6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm
Host: localhost:8080

--6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm
Content-Disposition: form-data; name=metadata; filename=metadata.json
Content-Type: application/json

{"versioningService":"none","depth":0,"contentUri":"https://www.google.com","size":0,"metadata":{},"tags":[]}
--6o2knFse3p53ty9dmcQvWAIx1zInP11uCfbm--

Finally, the response of the ‘virtual upload’ looks as before. However, please keep in mind that there are NO CHECKS of what’s behind the contentUri. This happens only at download time, which will be the next example.

HTTP/1.1 201 Created
Location: http://localhost:8080/api/v1/dataresources/edbf964c-f215-4fc6-9ef1-2ff1ea5a811e/data/referencedContent?version=1
Resource-Version: 1