If the offsets for exported HTML labels don’t match your expected output, such as with HTML named entity recognition (NER) tasks, the most common reason why is due to HTML minification.
When you upload HTML files to Label Studio for labeling, the HTML is minified to remove whitespace. When you annotate those tasks, the offsets for the labels apply to the minified version of the HTML, rather than the original unmodified HTML files.
To prevent the HTML files from being minified, you can use a different import method. See Import HTML data.
If you want to correct existing annotations, you can minify your source HTML files in the same way that Label Studio does. The minification is performed with the following script:
with open("sample.html", "r") as f:
html_doc = f.read()
minified_html_doc = htmlmin.minify(html_doc, remove_all_empty_space=True)
If minification does not seem to be affecting the offset placements, complex CSS or other reasons could be the cause.