OCR Index & Processing : EBundle Support

Summary

All documents uploaded should be text searchable to ensure full functionality can be utilised across the platform. To make a document text searchable, they will need to undergo a Optical Character Recognition ('OCR') process to convert the image of text on the page to searchable OCR data.

Examples of functionality on the platform that requires OCR data to be applied (non-exhaustive):
- searching through the text of documents
- affixing notes and highlighting to text
- copying text from a document

Whilst this is a formatting requirement and Opus will therefore assume all documents provided are all OCR'd, the Solution Operations team will undergo a spot check of a few documents during the initial upload to hopefully detect any widespread issues. As it is just a spot check, not all OCR issues will be picked up and instead Opus can produce an OCR index once documents have been uploaded.

We've set out below how the OCR index can be produced and how Opus can run an OCR process.

Producing the Index

We are able to generate an index from the documents uploaded to the platform upon request which confirms the status of the OCR data.

It is important to understand that if some OCR data is detected, being True or Partial results, we are unable to confirm the accuracy of the data or that the full contents of the page is searchable i.e. the result could represent anything from just a stamp to the entire page being searchable, which could be completely inaccurate or be perfectly searchable – it just represents that some OCR data is present on the page.

True - There is OCR data on all pages.
Partial - There is OCR data on some pages and some pages with no OCR data.
False - There is no OCR data across any of the pages.

The creation of the index will be charged at our content management rates. Also, the index will not account for documents that have been replaced at any stage, i.e. the status will be reflective of the original document that was uploaded to the platform, not the replacement file that now appears and we are unable to confirm if any files have been replaced.

Opus OCR Process

Parties are able to OCR the documents themselves and provide updated versions to Opus for upload or replacement.

Alternatively, Opus are able to assist with running an OCR process (subject to capacity) if the parties are unable OCR documents themselves, on the following basis:

OCR Issue Detection - We are unable to automatically detect which documents have not been OCR'd before upload and therefore we would need to either be instructed which documents need to have the OCR process run across them, i.e. from an OCR index (keeping in mind the limitations of the index results), or we can proceed with OCRing all of the documents in any relevant bundles, at the price of £0.05 per page.
Accuracy of OCR Data - In running the OCR software across the document set, it will apply OCR data where possible and the accuracy of this data will ultimately depend on the format of the PDF file itself along with the quality of the image contained on each page. As a result, this may mean that the software is unable to apply OCR data to certain sections of the document. Similarly, the software may attempt to rotate pages to be the correct orientation based on the text but the accuracy of this process depends on the pages themselves and this will not be checked by Opus 2.
OCR Language - We are only able to OCR in English. If there are any documents containing text of a different language, the software would try to OCR the text in English which may result in inaccurate letters / numbers being assigned, reducing the functionality of this OCR data. Additionally, if any pre-existing non-English OCR data has been applied, it may be removed and replaced with the English data as part of the OCR process.

If the documents already exist on the platform, we would need to export them from the platform, OCR and replace them - with each action being chargeable at standard rates.