Changes

Paul · 311a59eb
--- a/system/Import-data-using-the-RUB-importer.md
+++ b/system/Import-data-using-the-RUB-importer.md
@@ -2,4 +2,107 @@
 ⚠️ Note: This page is within the `system` namespace/directory because as of 2024-09-20, the RUB importer in ReSeeD only supports importing from a subdirectory of the same S3 share ("share" meaning "parent of an S3 bucket") that is also used by ReSeeD to store the imported data (configured using the `S3_ENDPOINT`, `S3_ACCESS_KEY`, `S3_SECRET_KEY`, and `S3_REGION` variables in the `.env` file). The name of the S3 bucket from which the RUB importer imports the data is configured in the `S3_FILE_UPLOAD_BUCKET` variable in the `.env` file. Because access to this S3 share requires sysadmin access anyway, this page belongs into the `system` directory of the wiki for the time being.
-This is not documented yet, pending https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/-/issues/407.
\ No newline at end of file
+This process uses the Bulkrax *CSV from S3 parser* to do imports. Data is prepared in CSV files, with commas used to separate columns, and semi-colons used (in some cases) to separate values within a single column.
+## Prepare the data
+The data to be imported needs to have the following file structure 
+* There should be a file called `metadata.csv`
+	* The format of the columns in the `metadata.csv` file is explained in The *Metadata CSV Format* section (below)
+	* The CSV file should contain one row for each dataset to be imported
+		* The row should mention the path to the dataset relative to the directory containing the `metadata.csv` in the column `dataset_path`.
+* Within each dataset path, there should be a directory named `data` where all the data for the dataset is placed.
+	An example data strurcture for 2 datatsets is shown below
+	```
+	cl-reseed_import/set1/
+	├── dataset1
+	│   └── data
+	│       └── 1529
+	│           ├── folder_1
+	│           │   ├── another_file.exe
+	│           │   └── some_other_file.json
+	│           ├── my_software.exe
+	│           └── mydata.json
+	├── dataset2
+	│   └── data
+	│       ├── AV02CP07GI0
+	│       │   ├── anat
+	│       │   │   └── sub-AV02CP07GI0_T1w.nii
+	│       │   └── func
+	│       │       └── sub-AV02CP07GI0_task-rest_bold.nii
+	│       ├── CHANGES
+	│       ├── README
+	│       ├── dataset_description.json
+	│       ├── participants.json
+	│       └── participants.tsv
+	└── metadata.csv
+	```
+* The example zip file [Example_RUB_import_data.zip](https://gitlab.ruhr-uni-bochum.de/-/project/864/uploads/afa04d2c5f75c5e5a7d5e69248a1f58e/Example_RUB_import_data.zip) has the datasets and the `metadata.csv` structured as needed. 
+### Steps to run an import
+1. Upload the data you want to import (for example: the unzipped data in [Example_RUB_import_data.zip](https://gitlab.ruhr-uni-bochum.de/-/project/864/uploads/671860fcf003818d516cf4a6d11b8e20/Example_RUB_import_data.zip)) into the S3 bucket that ReSeed has access to
+	 For example: `cl-reseed_import`
+	 This bucket name needs to be filled in the form for `Specify a bucket name with prefix`
+2. Log into RDMS as an administrator.
+3. On the dashboard you should see the options Importers and Exporters. Click on Importers. 
+	 ![image](./Dashboard_showing_import_and_export.png)
+4. In the importers page, click on `New` on the top left corner. This would open the importer form
+	 ![image](./Screenshot_2024-11-20_at_12.50.19.png)
+5. Fill in the Importer form with the values as shown in the screenshot above and click on **Create and import**
+	 | Field name                        | Value                                                        | Note                                                         |
+	 | --------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
+	 | Name                              | Any identifiable name for the import                         |                                                              |
+	 | Administrative Set                | RUB publication workflow                                     | This will apply this workflow to all imported datasets       |
+	 | Frequency                         | Once                                                         | We are running a one off import                              |
+	 | Limit                             | 0 or leave blank                                             | This will import all records in the metadata.csv file        |
+	 | Parser                            | CSV from S3 - ReSeed CSV parser for work (Datasets) from local S3 | This will choose the parser for ReSeed                       |
+	 | Visibility                        | Private                                                      | The workflow will need all datasets to be private until published |
+	 | Rights statement                  | Leave blank                                                  | It will pick up the rights statement from the csv file       |
+	 | Specify a bucket name with prefix | cl-reseed_import                                             | The bucket name with the prefix.<br />You could also add a path within the bucket, for example: <br />`cl-reseed_import/set1` |
+## The Metadata CSV Format
+| **Column header**           | **Cardinality** | **Format**                                                   | **Example 1**                                                | **Example 2**                                                |
+| --------------------------- | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| **title**                   | One             | String  The title of the dataset                             | Test dataset 1 for import                                    | Test dataset 2 for import                                    |
+| **dataset_path**            | One             | String  Folder path within the bucket                        | dataset1                                                     | dataset2                                                     |
+| **alternative_title**       | Zero or more    | String    The alternative title(s) of the dataset    Multiple values should be separated with a semi-colon. | The rhythms of old men who hit things with sticks            | The rhythms of old men who hit things with sticks; Huh?      |
+| **description**             | Zero or one     | String  Description of the dataset                           | A collection of rhythms from veteran rock drummers           | A collection of rhythms from veteran rock drummers           |
+| **contributor**             | Zero or more    | Names should be entered in the format: **LAST_NAME, FORENAME(S).**   Multiple contributors should be separated with a semi-colon.   The order of names is significant in relating them to: contributor_orcid contributor_affiliation | Starr, Ringo; Bonham, John; Densmore, John; Moon, Keith      | Starr, Ringo; Bonham, John; Densmore, John; Moon, Keith      |
+| **contributor_orcid**       | Zero or more    | ORCIDS should be entered in their full https format.   The order of ORCIDS is significant in relating them to contributor.   ORCIDS should be separated with a semi-colon. It should ideally have the same number of semi-colons as contributor. | [;;https://orcid.org/0000-0001-5109-3700;](https://orcid.org/0000-0001-5109-3700) | https://orcid.org/0000-0001-0001-3700;;;                     |
+| **contributor_affiliation** | Zero or more    | String    The order of affiliations is significant in relating them to contributor.    Affiliations should be separated with a semi-colon. It should ideally have the same number of semi-colons as contributor. | The Beatles; Led Zeppelin; The Doors; The Who                | The Beatles;;The Doors;                                      |
+| **creator**                 | One or more     | Names should be entered in the format: LAST_NAME, FORENAME(S)  Multiple creators should be separated with a semi-colon.  The order of names is significant in relating them to: creator_orcid creator_affiliation | Lennon, John                                                 | Lennon, John; McCartney, Paul                                |
+| **creator_orcid**           | One or more     | ORCIDS should be entered in their full https format.  The order of ORCIDS is significant in relating them to creator.   ORCIDS should be separated with a semi-colon. It should ideally have the same number of semi-colons as creator. | https://orcid.org/0000-0001-5109-3700                        | https://orcid.org/0000-0001-5109-3700;https://orcid.org/0000-0001-5109-3701 |
+| **creator_affiliation**     | One or more     | String   The order of affiliations is significant in relating them to creator   Affiliations should be separated with a semi-colon. It should ideally have the same number of semi-colons as creator. | The Beatles                                                  | The Beatles;The Beatles                                      |
+| **keyword**                 | One or more     | String    Multiple keywords should be separated with a semi-colon | drumming                                                     | drumming; pop stars                                          |
+| **resource_type**           | One or more     | Must be one or more of: *Book* *BookChapter* *Collection* *ComputationalNotebook* *ConferencePaper* *DataPaper* *Dataset* *Dissertation* *Event* *Image* *InteractiveResource* *Journal* *JournalArticle* *Model* *OutputManagementPlan* *PeerReview* *PhysicalObject* *Preprint* *Report* *Service* *Software* *Sound* *Standard* *Text* *Workflow* *Other*   If the value is not one of the allowed values, we will set it to Dataset | Dataset                                                      | Dataset                                                      |
+| **license**                 | One             | [Must be one of](http://rightsstatements.org/vocab/InC/1.0/) http://rightsstatements.org/vocab/InC/1.0/ [https://creativecommons.org/licenses/by/4.0/](http://rightsstatements.org/vocab/InC/1.0/) [https://creativecommons.org/licenses/by-sa/4.0/](http://rightsstatements.org/vocab/InC/1.0/) [https://creativecommons.org/licenses/by-nd/4.0/](http://rightsstatements.org/vocab/InC/1.0/) [https://creativecommons.org/licenses/by-nc/4.0/](http://rightsstatements.org/vocab/InC/1.0/) [https://creativecommons.org/licenses/by-nc-nd/4.0/](http://rightsstatements.org/vocab/InC/1.0/) [https://creativecommons.org/licenses/by-nc-sa/4.0/](http://rightsstatements.org/vocab/InC/1.0/) [http://creativecommons.org/publicdomain/zero/1.0/](http://rightsstatements.org/vocab/InC/1.0/) [http://creativecommons.org/publicdomain/mark/1.0/](http://rightsstatements.org/vocab/InC/1.0/) [http://www.apache.org/licenses/LICENSE-2.0](http://rightsstatements.org/vocab/InC/1.0/) [http://www.gnu.org/licenses/gpl.html](http://rightsstatements.org/vocab/InC/1.0/) [http://opensource.org/licenses/MIT](http://rightsstatements.org/vocab/InC/1.0/)   [If the license URI is not one of the allowed values, we will ignore it](http://rightsstatements.org/vocab/InC/1.0/) | http://creativecommons.org/publicdomain/mark/1.0/            | http://opensource.org/licenses/MIT                           |
+| **date**                    | Zero or more    | Dates should be entered in the format: YYYY-MM-DD <DATE-TYPE>. For example, 2024-05-29 Created   Multiple dates should be separated with a semi-colon.   Each date must have a date type which must be one of the following: *Accepted* *Available* *Copyrighted* *Collected* *Created* *Deposited* *Published ** *Recorded* *Registered* *Submitted* *Updated* *Archived*   *If the date type is not one of the allowed values, we will ignore the date and the type*   *The dates entered here are all metadata dates.*   *The system dates are saved in create_date, date_modified, modified_date, date_uploaded*   *The published* date if entered above will be overwritten when you go through the submission and review workflow. | 2024-05-29 Created; 2024-06-10 Published                     | 2024-05-29 Created; 2024-06-10 Published                     |
+| **subject**                 | Zero or more    | String    Multiple subjects should be separated with a semi-colon | drumming                                                     | Drumming; music                                              |
+| **language**                | Zero or more    | String    Multiple languages should be separated with a semi-colon | English                                                      | English                                                      |
+| **location**                | Zero or more    | String    Multiple languages should be separated with a semi-colon | London                                                       |                                                              |
+| **software_version**        | Zero or more    | String    Multiple software versions should be separated with a semi-colon |                                                              |                                                              |
+| **funder_identifier**       | Zero or more    | Identifiers should be entered as full URIs  Multiple funders Identifier should be separated with a semi-colon  The order of identifiers is significant in relating them to: funder_name award_number award_uri award_title | http://dx.doi.org/10.13039/501100001659                      | http://dx.doi.org/10.13039/501100001659;http://dx.doi.org/10.13039/50110000165999 |
+| **funder_name**             | Zero or more    | Multiple funder’s name should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier.     The order of funder name is significant in relating them to: funder_identifier award_number award_uri award_title | DFG                                                          | DFG;RUB                                                      |
+| **award_number**            | Zero or more    | Multiple Funder's award number should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier.  The order of award number is significant in relating them to: funder_identifier funder_name award_uri award_title | A0001                                                        | A0001;W3asxa3                                                |
+| **award_uri**               | Zero or more    | Multiple Funder's award uri should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier.  The order of award uri is significant in relating them to: funder_identifier funder_name award_number award_title |                                                              |                                                              |
+| **award_title**             | Zero or more    | Multiple Funder's award uri should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier.  The order of award uri is significant in relating them to: funder_identifier funder_name award_number award_title |                                                              |                                                              |
\ No newline at end of file