Update Using Bulkrax CRC1280 Folder Parser to do Imports: Added section... authored by Pascal Ernster's avatar Pascal Ernster
Update Using Bulkrax CRC1280 Folder Parser to do Imports: Added section "Importing a subset of experiments from a file share"
## Using Bulkrax *CRC1280 folder parser* to do imports ## Using Bulkrax *CRC1280 folder parser* to do imports
### Prepare the data ### Prepare the data
The Bulkrax *CRC1280 folder parser* expects the CRC1280 data to follow a very specific pattern, starting from a folder path within which is the group/experiment/subject/session/modality The Bulkrax *CRC1280 folder parser* expects the CRC1280 data to follow a very specific pattern, starting from a folder path within which is the group/experiment/subject/session/modality
The text file shows an example import_data tree [import_data.txt](/system/development-notes/import_data.txt) The text file shows an example import_data tree [import_data.txt](/system/development-notes/import_data.txt)
Copy the directory to be imported within `CRC_FOLDER_IMPORT_PATH` (the import data directory that is mounted onto the web and workers container as explained below), so it's available to the importer. Copy the directory to be imported within `CRC_FOLDER_IMPORT_PATH` (the import data directory that is mounted onto the web and workers container as explained below), so it's available to the importer.
*Note for developers / system administrators* *Note for developers / system administrators*
* The location of the test data folder in the host machine (from where you will be running docker) will need to be added to the [.env file](https://gitlab.ruhr-uni-bochum.de/researchdata/rdms/-/blob/develop/.env.template#L90). * The location of the test data folder in the host machine (from where you will be running docker) will need to be added to the [.env file](https://gitlab.ruhr-uni-bochum.de/researchdata/rdms/-/blob/develop/.env.template#L90).
* The environment variable is `CRC_FOLDER_IMPORT_PATH`. * The environment variable is `CRC_FOLDER_IMPORT_PATH`.
* This is mapped to the path `/rub-test-data` in the container in docker (as defined in the [docker-compose.yml](https://gitlab.ruhr-uni-bochum.de/researchdata/rdms/-/blob/develop/docker-compose.yml#L99)) * This is mapped to the path `/rub-test-data` in the container in docker (as defined in the [docker-compose.yml](https://gitlab.ruhr-uni-bochum.de/researchdata/rdms/-/blob/develop/docker-compose.yml#L99))
ℹ After changing the contents of the `.env` file, stop the containers and start them again (updated values only apply after restarting the containers): ℹ After changing the contents of the `.env` file, stop the containers and start them again (updated values only apply after restarting the containers):
```shell ```shell
$ docker compose -f docker-compose.yml down $ docker compose -f docker-compose.yml down
$ docker compose -f docker-compose.yml up -d $ docker compose -f docker-compose.yml up -d
``` ```
#### Import RUB s3 test data of CRC1280 #### Import RUB s3 test data of CRC1280
1. rclone mount s3 bucket (ceph) with test data 1. rclone mount s3 bucket (ceph) with test data
`rclone mount --daemon s3-rdms-test:fowi-rdms-testbucket/20220425_Test_GroupData /root/rdms.develop/hyrax/rub-test-data/20220425_Test_GroupData/` `rclone mount --daemon s3-rdms-test:fowi-rdms-testbucket/20220425_Test_GroupData /root/rdms.develop/hyrax/rub-test-data/20220425_Test_GroupData/`
⚠️ For reasons not yet fully understood, the `rclone mount` process used to mount the S3 bucket into the bulk importer directory must be run as `root`, even if the RDMS containers are started by a non-root user. We suspect that this is a limitation of the kernel's `fuse` module, and that the `rclone mount` process must be run under the same UID as the importer processes inside the docker containers (which run as root, too). ⚠️ For reasons not yet fully understood, the `rclone mount` process used to mount the S3 bucket into the bulk importer directory must be run as `root`, even if the RDMS containers are started by a non-root user. We suspect that this is a limitation of the kernel's `fuse` module, and that the `rclone mount` process must be run under the same UID as the importer processes inside the docker containers (which run as root, too).
### Steps to run an import ##### Importing a subset of experiments from a file share
1. Log into RDMS as an administrator.
The CRC1280 importer will always import *all* data from `/home/reseed/reseed/bulk-ingest` (which is mounted as `/rub-test-data` inside the `web` and the `worker` containers). If you want to import only a subset of the experiments on a given fileshare, you will have to manually construct the directory structure that the CRC1280 importer expects for the experiments to be imported in `/home/reseed/reseed/bulk-ingest`, and then manually mount the corresponding directory for each experiment individually.
2. On the dashboard you should see the options Importers and Exporters. Click on Importers.
### Steps to run an import
![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/a921a9c3512521ab988282e306edb9d2/Dashboard_showing_import_and_export.png) 1. Log into RDMS as an administrator.
3. In the importers page, click on `New` on the top left corner. This would open the importer form 2. On the dashboard you should see the options Importers and Exporters. Click on Importers.
![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/dd7547953559a3b4d3057848208bfd61/Importers.png) ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/a921a9c3512521ab988282e306edb9d2/Dashboard_showing_import_and_export.png)
4. Fill in the Importer form 3. In the importers page, click on `New` on the top left corner. This would open the importer form
![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/3a8646a72252b52bce19aabb700691f4/new_importer.png) ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/dd7547953559a3b4d3057848208bfd61/Importers.png)
Name - Any name you would like to give the import job <br><br> 4. Fill in the Importer form
Administrative set - choose the default admin set <br><br>
Frequency - once <br><br> ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/3a8646a72252b52bce19aabb700691f4/new_importer.png)
limit - empty <br><br>
Parser - CRC1280 folder parser <br><br> Name - Any name you would like to give the import job <br><br>
Visibility - What you would like the visibility of the imported records to be. <br><br> Administrative set - choose the default admin set <br><br>
Add folder path - Specify a path on the server <br><br> Frequency - once <br><br>
Import file path - /rub-test-data/test1/ <br> limit - empty <br><br>
Note: Parser - CRC1280 folder parser <br><br>
Visibility - What you would like the visibility of the imported records to be. <br><br>
* The folder you want to import is **test1** Add folder path - Specify a path on the server <br><br>
* **test1** should contain the CRC1280 data starting from the group as shown in [import_data.txt](/system/development-notes/import_data.txt) Import file path - /rub-test-data/test1/ <br>
* The data has been copied to the shared mount as stated in the section *prepare the data*). Note:
## Checking progress of the import * The folder you want to import is **test1**
* **test1** should contain the CRC1280 data starting from the group as shown in [import_data.txt](/system/development-notes/import_data.txt)
### Importers page / dashboard * The data has been copied to the shared mount as stated in the section *prepare the data*).
You can have an overview of the importer status from the Importers page ## Checking progress of the import
![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/dd7547953559a3b4d3057848208bfd61/Importers.png) ### Importers page / dashboard
### View page for each Importer You can have an overview of the importer status from the Importers page
Clicking on the importer, would give you details on the current status of the import. You can also re-run an import from this page. ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/dd7547953559a3b4d3057848208bfd61/Importers.png)
**Import jobs are running** ### View page for each Importer
![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/30448982751ce8b2c38bd9e9dec8cb39/importer_running.png) Clicking on the importer, would give you details on the current status of the import. You can also re-run an import from this page.
​ You should be able to monitor the status of the import of each job and view errors, if any. **Import jobs are running**
### Sidekiq interface to monitor background jobs and ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/30448982751ce8b2c38bd9e9dec8cb39/importer_running.png)
You can also monitor background jobs running in Hyrax, including the background jobs created by the importer in the sidekiq interface. ​ You should be able to monitor the status of the import of each job and view errors, if any.
Sidekiq is available at the endpoint `/sidekiq` (for example https://rdms.cottagelabs.com/sidekiq). ### Sidekiq interface to monitor background jobs and
You need to logged as an administrator, to be able to view sidekiq. You can also monitor background jobs running in Hyrax, including the background jobs created by the importer in the sidekiq interface.
From the interface you can monitor and administer all of the jobs in the queues and their status. Sidekiq is available at the endpoint `/sidekiq` (for example https://rdms.cottagelabs.com/sidekiq).
![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/d16c35ade066246e966d96560762f9ba/sidekiq.png) You need to logged as an administrator, to be able to view sidekiq.
**Note** From the interface you can monitor and administer all of the jobs in the queues and their status.
There is one job you cannot monitor from the importer dashboard. It is the job scheduled to run after all of the collections, works and filesets have been imported. It is the job to create relationships between the collections, works and filesets. If this is not t=run, the uploaded files will not be associated to the filesets. ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/d16c35ade066246e966d96560762f9ba/sidekiq.png)
**Note**
![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/4192f5e1ac8e1ba09e7c3212e7fb34d3/scheduled_job.png)
There is one job you cannot monitor from the importer dashboard. It is the job scheduled to run after all of the collections, works and filesets have been imported. It is the job to create relationships between the collections, works and filesets. If this is not t=run, the uploaded files will not be associated to the filesets.
### CSV file created by the importer
The importer first creates a csv file for the folder to be imported. This is then imported with a customised csv importer. ![image](https://gitlab.ruhr-uni-bochum.de/FDM/rdm-system/antleaf-projectmanagement/uploads/4192f5e1ac8e1ba09e7c3212e7fb34d3/scheduled_job.png)
The csv files are stored at `rdms/hyrax/tmp/imports/`, where `rdms` is the root of the source directory checked out and from where the `workers` docker container is running. ### CSV file created by the importer
### `Mark as completed` and `Rerun` buttons The importer first creates a csv file for the folder to be imported. This is then imported with a customised csv importer.
1. Mark as completed The csv files are stored at `rdms/hyrax/tmp/imports/`, where `rdms` is the root of the source directory checked out and from where the `workers` docker container is running.
This button is displayed the numbers show that the import has completed (total and number processed are the same, and the number failed is 0), but Bulkrax has got it's counting wrong and thinks the import is pending (like in the screen shot below). In such a case, it is safe to click on `Mark as completed`. It will change the status of the import from pending to completed (and does nothing else). ### `Mark as completed` and `Rerun` buttons
![Screenshot_20240131_101755](/uploads/6fb05dee1140fc4f0075fa621f30d066/Screenshot_20240131_101755.png) 1. Mark as completed
![Screenshot_20240131_101755-highlighted](/uploads/0ddee51ed8b92c812ebda9bcde1e6d3b/Screenshot_20240131_101755-highlighted.png)` This button is displayed the numbers show that the import has completed (total and number processed are the same, and the number failed is 0), but Bulkrax has got it's counting wrong and thinks the import is pending (like in the screen shot below). In such a case, it is safe to click on `Mark as completed`. It will change the status of the import from pending to completed (and does nothing else).
2. Rerun ![Screenshot_20240131_101755](/uploads/6fb05dee1140fc4f0075fa621f30d066/Screenshot_20240131_101755.png)
This button is displayed when an import has completed, but with errors. When this button is clicked, it will cycle through all the entries in the importer and rerun all the failed jobs. ![Screenshot_20240131_101755-highlighted](/uploads/0ddee51ed8b92c812ebda9bcde1e6d3b/Screenshot_20240131_101755-highlighted.png)`
It is worth going into Sidekiq (for example - https://rdms.cottagelabs.com/sidekiq/. You need to logged in as admin to view this URL) first and checking if the job is still being processed in sidekiq (in busy, enqueued or retries). If it's an error like `Failed to acquire lock`, then we have noticed that this goes away on a retry. 2. Rerun
If there are no jobs listed in sidekiq, you can click on rerun to rerun all the failed jobs. This button is displayed when an import has completed, but with errors. When this button is clicked, it will cycle through all the entries in the importer and rerun all the failed jobs.
![Screenshot_20240322_104759](/uploads/3eebde480ec32b8d22df1746f4639d49/Screenshot_20240322_104759.png) It is worth going into Sidekiq (for example - https://rdms.cottagelabs.com/sidekiq/. You need to logged in as admin to view this URL) first and checking if the job is still being processed in sidekiq (in busy, enqueued or retries). If it's an error like `Failed to acquire lock`, then we have noticed that this goes away on a retry.
### Points to note If there are no jobs listed in sidekiq, you can click on rerun to rerun all the failed jobs.
When a new CRCDataset is imported with the same collection name as an existing CRC1280 collection (either previously imported or created by a user), it will create a new collection, rather than reuse the one previously created. ![Screenshot_20240322_104759](/uploads/3eebde480ec32b8d22df1746f4639d49/Screenshot_20240322_104759.png)
### Points to note
When a new CRCDataset is imported with the same collection name as an existing CRC1280 collection (either previously imported or created by a user), it will create a new collection, rather than reuse the one previously created.
When a new CRCDataset is imported again (it was previously imported), it will create a new collection and a new experiment, rather than reuse the ones previously created. When a new CRCDataset is imported again (it was previously imported), it will create a new collection and a new experiment, rather than reuse the ones previously created.
\ No newline at end of file