When working with an external Cloud Storage connection (S3, GCS, Azure), there are few things to bear in mind:
- Label Studio doesn’t import the data stored in the bucket, but instead creates references to the objects. Therefore, you have full access control on the data to be synced and shown on the labeling screen.
- The Sync with the bucket is only one way - it’s either creating tasks from objects on the bucket (Source storage) or pushing annotations to the output bucket (Target storage). Changing something on the bucket side doesn’t guarantee consistency in results.
- It is recommended to use a separate bucket folders for each Label Studio project.
When I click Sync, I don't see my data in project
Go to the cloud storage settings page, click on Edit cloud storage connection card settings and check the following:
- File Filter Regex is set and correct. When no filters are specified, all found items are skipped. The filter should be a valid regular expression, not a wildcard (e.g.
.*
is a valid,*.
not valid) - Treat every bucket object as a source file should be
ON
if you work with images, audio, text files or any other binary content stored in the bucket. It instructs Label Studio to create URI endpoints and store this as a labeling task payload, and resolve them into presigned https URLs when opening the labeling screen. If you store JSON tasks in the Label Studio format in your bucket - turn this toggleOFF
- Sometimes the sync process doesn’t start immediately. That is because syncing process is based on internal job scheduler. Please wait, if nothing happens during long period of time - contact us via form, and please provide the time when you launched the “Sync” job
- An easy way to check rq workers is to run an export: go to the Data manager, click Export, and create a new snapshot and download the JSON file. If you see an Error, most likely your rq workers have problems. Another way to check rq workers - login as a superuser and go to /django-rq page. You should see a
workers
column,workers
values shouldn’t be 0 as far as failed column should be empty (0).
JSON files from a cloud storage are not synced, the data manager is empty
Diagnostic steps:
- Try to enable “Treat every bucket object”. Do you see tasks in DM? If yes, go to (2).
- Try to disable “Treat every bucket objects”. If you don’t see tasks in DM, your bucket doesn’t have GET permission, seems like it has LIST permission only.
Why does it happen? Because for (1) Label Studio scans bucket and doesn’t read objects, it needs to check existence only. In (2) Label Studio reads data, because it has to extract your JSON files to LS DB.
When I click Sync, I see my tasks in the Data Manager, but there is the CORS error inside of tasks
It’s a problem with permissions in your bucket. Check this section https://labelstud.io/guide/storage.html#Source-storage-permissions carefully.