Changes

Paul · 6381bfc3
--- a/system/development-notes/Using-Bulkrax-CRC1280-Folder-Parser-to-do-Imports.md
+++ b/system/development-notes/Using-Bulkrax-CRC1280-Folder-Parser-to-do-Imports.md
+## Using Bulkrax *CRC1280 folder parser* to do imports
+
+### Prepare the data
+
+The Bulkrax *CRC1280 folder parser* expects the CRC1280 data to follow a very specific pattern, starting from a folder path within which is the group/experiment/subject/session/modality
+
+The text file shows an example import_data tree [import_data.txt](/system/development-notes/import_data.txt)
+
+Copy the directory to be imported within `CRC_FOLDER_IMPORT_PATH` (the import data directory that is mounted onto the web and workers container as explained below), so it's available to the importer.
+
+*Note for developers / system administrators*
+
+* The location of the test data folder in the host machine (from where you will be running docker) will need to be added to the [.env file](https://gitlab.ruhr-uni-bochum.de/researchdata/rdms/-/blob/develop/.env.template#L90).
+
+* The environment variable is `CRC_FOLDER_IMPORT_PATH`. 
+
+* This is mapped to the path `/rub-test-data` in the container in docker (as defined in the [docker-compose.yml](https://gitlab.ruhr-uni-bochum.de/researchdata/rdms/-/blob/develop/docker-compose.yml#L99))
+
+ℹ After changing the contents of the `.env` file, stop the containers and start them again (updated values only apply after restarting the containers):
+```shell
+$ docker compose -f docker-compose.yml down
+$ docker compose -f docker-compose.yml up -d
+```
+
+#### Import RUB s3 test data of CRC1280
+
+1. rclone mount s3 bucket (ceph) with test data 
+   `rclone mount --daemon s3-rdms-test:fowi-rdms-testbucket/20220425_Test_GroupData /root/rdms.develop/hyrax/rub-test-data/20220425_Test_GroupData/`
+   
+   ⚠️ For reasons not yet fully understood, the `rclone mount` process used to mount the S3 bucket into the bulk importer directory must be run as `root`, even if the RDMS containers are started by a non-root user. We suspect that this is a limitation of the kernel's `fuse` module, and that the `rclone mount` process must be run under the same UID as the importer processes inside the docker containers (which run as root, too).
+
+### Steps to run an import
+1. Log into RDMS as an administrator.
+
+2. On the dashboard you should see the options Importers and Exporters. Click on Importers. 
+
+   ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/a921a9c3512521ab988282e306edb9d2/Dashboard_showing_import_and_export.png)
+
+3. In the importers page, click on `New` on the top left corner. This would open the importer form
+
+   ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/dd7547953559a3b4d3057848208bfd61/Importers.png)
+
+4. Fill in the Importer form
+
+   ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/3a8646a72252b52bce19aabb700691f4/new_importer.png)
+
+   Name - Any name you would like to give the import job <br><br>
+   Administrative set - choose the default admin set <br><br>
+   Frequency - once <br><br>
+   limit - empty <br><br>
+   Parser - CRC1280 folder parser <br><br>
+   Visibility - What you would like the visibility of the imported records to be.  <br><br>
+   Add folder path - Specify a path on the server <br><br>
+   Import file path - /rub-test-data/test1/ <br>
+   Note:
+   
+   * The folder you want to import is **test1** 
+   * **test1** should contain the CRC1280 data starting from the group as shown in [import_data.txt](/system/development-notes/import_data.txt)
+   * The data has been copied to the shared mount as stated in the section *prepare the data*).
+
+## Checking progress of the import
+
+### Importers page / dashboard
+
+   You can have an overview of the importer status from the Importers page
+
+   ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/dd7547953559a3b4d3057848208bfd61/Importers.png)
+
+### View page for each Importer
+
+   Clicking on the importer, would give you details on the current status of the import. You can also re-run an import from this page.
+
+   **Import jobs are running**
+
+   ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/30448982751ce8b2c38bd9e9dec8cb39/importer_running.png)
+
+	You should be able to monitor the status of the import of each job and view errors, if any.
+
+### Sidekiq interface to monitor background jobs and 
+
+   You can also monitor background jobs running in Hyrax, including the background jobs created by the importer in the sidekiq interface. 
+
+   Sidekiq is available at the endpoint `/sidekiq` (for example https://rdms.cottagelabs.com/sidekiq).
+
+   You need to logged as an administrator, to be able to view sidekiq.
+
+   From the interface you can monitor and administer all of the jobs in the queues and their status.
+
+   ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/d16c35ade066246e966d96560762f9ba/sidekiq.png)
+
+**Note**
+
+There is one job you cannot monitor from the importer dashboard. It is the job scheduled to run after all 	of the collections, works and filesets have been imported. It is the job to create relationships between the collections, works and filesets. If this is not t=run, the uploaded files will not be associated to the filesets.
+
+
+![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/4192f5e1ac8e1ba09e7c3212e7fb34d3/scheduled_job.png)
+
+### CSV file created by the importer
+
+The importer first creates a csv file for the folder to be imported. This is then imported with a customised csv importer.
+
+The csv files are stored at `rdms/hyrax/tmp/imports/`, where `rdms` is the root of the source directory checked out and from where the `workers` docker container is running.
+
+### `Mark as completed` and `Rerun` buttons
+
+1. Mark as completed
+
+    This button is displayed the numbers show that the import has completed (total and number processed are the same, and the number failed is 0), but Bulkrax has got it's counting wrong and thinks the import is pending (like in the screen shot below). In such a case, it is safe to click on `Mark as completed`. It will change the status of the import from pending to completed (and does nothing else).
+
+    ![Screenshot_20240131_101755](/uploads/6fb05dee1140fc4f0075fa621f30d066/Screenshot_20240131_101755.png)
+
+    ![Screenshot_20240131_101755-highlighted](/uploads/0ddee51ed8b92c812ebda9bcde1e6d3b/Screenshot_20240131_101755-highlighted.png)`
+
+2. Rerun
+
+    This button is displayed when an import has completed, but with errors. When this button is clicked, it will cycle through all the entries in the importer and rerun all the failed jobs. 
+
+    It is worth going into Sidekiq (for example - https://rdms.cottagelabs.com/sidekiq/. You need to logged in as admin to view this URL) first and checking if the job is still being processed in sidekiq (in busy, enqueued or retries). If it's an error like `Failed to acquire lock`, then we have noticed that this goes away on a retry. 
+
+    If there are no jobs listed in sidekiq, you can click on rerun to rerun all the failed jobs.
+
+    ![Screenshot_20240322_104759](/uploads/3eebde480ec32b8d22df1746f4639d49/Screenshot_20240322_104759.png)
+
+### Points to note
+
+When a new CRCDataset is imported with the same collection name as an existing CRC1280 collection (either previously imported or created by a user), it will create a new collection, rather than reuse the one previously created.
+
+When a new CRCDataset is imported again (it was previously imported), it will create a new collection and a new experiment, rather than reuse the ones previously created.
\ No newline at end of file