repos: 508461227

This data as json

id	node_id	name	full_name	private	owner	html_url	description	fork	created_at	updated_at	pushed_at	homepage	size	stargazers_count	watchers_count	language	has_issues	has_projects	has_downloads	has_wiki	has_pages	forks_count	archived	disabled	open_issues_count	license	topics	forks	open_issues	watchers	default_branch	permissions	temp_clone_token	organization	network_count	subscribers_count	readme	readme_html	allow_forking	visibility	is_template	template_repository	web_commit_signoff_required	has_discussions
508461227	R_kgDOHk6Aqw	s3-ocr	simonw/s3-ocr	0	9599	https://github.com/simonw/s3-ocr	Tools for running OCR against files stored in S3	0	2022-06-28T21:33:09Z	2022-08-10T21:24:45Z	2022-08-10T04:43:17Z		41	63	63	Python	1	1	1	1	0	3	0	0	7	apache-2.0	["ocr", "s3", "textract"]	3	7	63	main	{"admin": false, "maintain": false, "push": false, "triage": false, "pull": false}			3	2	# s3-ocr [![PyPI](https://img.shields.io/pypi/v/s3-ocr.svg)](https://pypi.org/project/s3-ocr/) [![Changelog](https://img.shields.io/github/v/release/simonw/s3-ocr?include_prereleases&label=changelog)](https://github.com/simonw/s3-ocr/releases) [![Tests](https://github.com/simonw/s3-ocr/workflows/Test/badge.svg)](https://github.com/simonw/s3-ocr/actions?query=workflow%3ATest) [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/s3-ocr/blob/master/LICENSE) Tools for running OCR against files stored in S3 Background on this project: [s3-ocr: Extract text from PDF files stored in an S3 bucket](https://simonwillison.net/2022/Jun/30/s3-ocr/) ## Installation Install this tool using `pip`: pip install s3-ocr ## Demo You can see the results of running this tool against three PDFs from the Internet Archive ([one](https://archive.org/details/unmaskingrobert00houdgoog), [two](https://archive.org/details/practicalmagicia00harr), [three](https://archive.org/details/latestmagicbeing00hoff)) in [this example table](https://s3-ocr-demo.datasette.io/pages/pages?_facet=path#facet-path) hosted using [Datasette](https://datasette.io/). ## Starting OCR against PDFs in a bucket The `start` command takes a list of keys and submits them to [Textract](https://aws.amazon.com/textract/) for OCR processing. You need to have AWS configured using environment variables, credentials file in your home directory or a JSON or INI file generated using [s3-credentials](https://datasette.io/tools/s3-credentials). You can start the process running like this: s3-ocr start name-of-your-bucket my-pdf-file.pdf The paths you specify should be paths within the bucket. If you stored your PDF files in folders inside the bucket it should look like this: s3-ocr start name-of-your-bucket path/to/one.pdf path/to/two.pdf OCR can take some time. The results of the OCR will be stored in `textract-output` in your bucket. To process every file in the bucket with a `.pdf` extension use `--all`: s3-ocr start name-of-bucket --all To process every file with a `.pdf` extension within a specific folder, use `--prefix`: s3-ocr start name-of-bucket --prefix path/to/folder ### s3-ocr start --help <!-- [[[cog import cog from s3_ocr import cli from click.testing import CliRunner runner = CliRunner() result = runner.invoke(cli.cli, ["start", "--help"]) help = result.output.replace("Usage: cli", "Usage: s3-ocr") cog.out( "```\n{}\n```".format(help) ) ]]] --> ``` Usage: s3-ocr start [OPTIONS] BUCKET [KEYS]... Start OCR tasks for PDF files in an S3 bucket s3-ocr start name-of-bucket path/to/one.pdf path/to/two.pdf To process every file with a .pdf extension: s3-ocr start name-of-bucket --all To process every .pdf in the PUBLIC/ folder: s3-ocr start name-of-bucket --prefix PUBLIC/ Options: --all Process all PDF files in the bucket --prefix TEXT Process all PDF files within this prefix --dry-run Show what this would do, but don't actually do it --no-retry Don't retry failed requests --access-key TEXT AWS access key ID --secret-key TEXT AWS secret access key --session-token TEXT AWS session token --endpoint-url TEXT Custom endpoint URL -a, --auth FILENAME Path to JSON/INI file containing credentials --help Show this message and exit. ``` <!-- [[[end]]] --> ## Checking status The `s3-ocr status <bucket-name>` command shows a rough indication of progress through the tasks: ``` % s3-ocr status sfms-history 153 complete out of 532 jobs ``` It compares the jobs that have been submitted, based on `.s3-ocr.json` files, to the jobs that have their results written to the `textract-output/` folder. ### s3-ocr status --help <!-- [[[cog result = runner.invoke(cli.cli, ["status", "--help"]) help = result.output.replace("Usage: cli", "Usage: s3-ocr") cog.out( "```\n{}\n```".format(help.split("--access-key")[0] + "--access-key ...") ) ]]] --> ``` Usage: s3-ocr status [OPTIONS] BUCKET Show status of OCR jobs for a bucket Options: --access-key ... ``` <!-- [[[end]]] --> ## Inspecting a job The `s3-ocr inspect-job <job_id>` command can be used to check the status of a specific job ID: ``` % s3-ocr inspect-job b267282745685226339b7e0d4366c4ff6887b7e293ed4b304dc8bb8b991c7864 { "DocumentMetadata": { "Pages": 583 }, "JobStatus": "SUCCEEDED", "DetectDocumentTextModelVersion": "1.0" } ``` ### s3-ocr inspect-job --help <!-- [[[cog result = runner.invoke(cli.cli, ["inspect-job", "--help"]) help = result.output.replace("Usage: cli", "Usage: s3-ocr") cog.out( "```\n{}\n```".format(help.split("--access-key")[0] + "--access-key ...") ) ]]] --> ``` Usage: s3-ocr inspect-job [OPTIONS] JOB_ID Show the current status of an OCR job s3-ocr inspect-job <job_id> Options: --access-key ... ``` <!-- [[[end]]] --> ## Fetching the results Once an OCR job has completed you can download the resulting JSON using the `fetch` command: s3-ocr fetch name-of-bucket path/to/file.pdf This will save files in the current directory with names like this: - `4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-1.json` - `4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-2.json` The number of files will vary depending on the length of the document. If you don't want separate files you can combine them together using the `-c/--combine` option: s3-ocr fetch name-of-bucket path/to/file.pdf --combine output.json The `output.json` file will then contain data that looks something like this: ``` { "Blocks": [ { "BlockType": "PAGE", "Geometry": {...} "Page": 1, ... }, { "BlockType": "LINE", "Page": 1, ... "Text": "Barry", }, ``` ### s3-ocr fetch --help <!-- [[[cog result = runner.invoke(cli.cli, ["fetch", "--help"]) help = result.output.replace("Usage: cli", "Usage: s3-ocr") cog.out( "```\n{}\n```".format(help.split("--access-key")[0] + "--access-key ...") ) ]]] --> ``` Usage: s3-ocr fetch [OPTIONS] BUCKET KEY Fetch the OCR results for a specified file s3-ocr fetch name-of-bucket path/to/key.pdf This will save files in the current directory called things like a806e67e504fc15f...48314e-1.json a806e67e504fc15f...48314e-2.json To combine these together into a single JSON file with a specified name, use: s3-ocr fetch name-of-bucket path/to/key.pdf --combine output.json Use "--output -" to print the combined JSON to standard output instead. Options: -c, --combine FILENAME Write combined JSON to file --access-key ... ``` <!-- [[[end]]] --> ## Fetching just the text of a page If you don't want to deal with the JSON directly, you can use the `text` command to retrieve just the text extracted from a PDF: s3-ocr text name-of-bucket path/to/file.pdf This will output plain text to standard output. To save that to a file, use this: s3-ocr text name-of-bucket path/to/file.pdf > text.txt Separate pages will be separated by three newlines. To separate them using a `----` horizontal divider instead add `--divider`: s3-ocr text name-of-bucket path/to/file.pdf --divider ### s3-ocr text --help <!-- [[[cog result = runner.invoke(cli.cli, ["text", "--help"]) help = result.output.replace("Usage: cli", "Usage: s3-ocr") cog.out( "```\n{}\n```".format(help.split("--access-key")[0] + "--access-key ...") ) ]]] --> ``` Usage: s3-ocr text [OPTIONS] BUCKET KEY Retrieve the text from an OCRd PDF file s3-ocr text name-of-bucket path/to/key.pdf Options: --divider Add ---- between pages --access-key ... ``` <!-- [[[end]]] --> ## Avoiding processing duplicates If you move files around within your S3 bucket `s3-ocr` can lose track of which files have already been processed. This can lead to additional Textract charges for processing should you run `s3-ocr start` against those new files. The `s3-ocr dedupe` command addresses this by scanning your bucket for files that have a new name but have previously been processed. It does this by looking at the `ETag` for each file, which represents the MD5 hash of the file contents. The command will write out new `.s3ocr.json` files for each detected duplicate. This will avoid those duplicates being run those duplicates through OCR a second time should yo run `s3-ocr start`. s3-ocr dedupe name-of-bucket Add `--dry-run` for a preview of the changes that will be made to your bucket. ### s3-ocr dedupe --help <!-- [[[cog result = runner.invoke(cli.cli, ["dedupe", "--help"]) help = result.output.replace("Usage: cli", "Usage: s3-ocr") cog.out( "```\n{}\n```".format(help.split("--access-key")[0] + "--access-key ...") ) ]]] --> ``` Usage: s3-ocr dedupe [OPTIONS] BUCKET Scan every file in the bucket checking for duplicates - files that have not yet been OCRd but that have the same contents (based on ETag) as a file that HAS been OCRd. s3-ocr dedupe name-of-bucket Options: --dry-run Show output without writing anything to S3 --access-key ... ``` <!-- [[[end]]] --> ## Changes made to your bucket To keep track of which files have been submitted for processing, `s3-ocr` will create a JSON file for every file that it adds to the OCR queue. This file will be called: path-to-file/name-of-file.pdf.s3-ocr.json Each of these JSON files contains data that looks like this: ```json { "job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe", "etag": "\"b0c77472e15500347ebf46032a454e8e\"" } ``` The recorded `job_id` can be used later to associate the file with the results of the OCR task in `textract-output/`. The `etag` is the ETag of the S3 object at the time it was submitted. This can be used later to determine if a file has changed since it last had OCR run against it. This design for the tool, with the `.s3-ocr.json` files tracking jobs that have been submitted, means that it is safe to run `s3-ocr start` against the same bucket multiple times without the risk of starting duplicate OCR jobs. ## Creating a SQLite index of your OCR results The `s3-ocr index <bucket> <database_file>` command creates a SQLite database containing the results of the OCR, and configures SQLite full-text search against the text: ``` % s3-ocr index sfms-history index.db Fetching job details [####################################] 100% Populating pages table [####################----------------] 55% 00:03:18 ``` The schema of the resulting database looks like this (excluding the FTS tables): ```sql CREATE TABLE [pages] ( [path] TEXT, [page] INTEGER, [folder] TEXT, [text] TEXT, PRIMARY KEY ([path], [page]) ); CREATE TABLE [ocr_jobs] ( [key] TEXT PRIMARY KEY, [job_id] TEXT, [etag] TEXT, [s3_ocr_etag] TEXT ); CREATE TABLE [fetched_jobs] ( [job_id] TEXT PRIMARY KEY ); ``` The database is designed to be used with [Datasette](https://datasette.io). ### s3-ocr index --help <!-- [[[cog result = runner.invoke(cli.cli, ["index", "--help"]) help = result.output.replace("Usage: cli", "Usage: s3-ocr") cog.out( "```\n{}\n```".format(help.split("--access-key")[0] + "--access-key ...") ) ]]] --> ``` Usage: s3-ocr index [OPTIONS] BUCKET DATABASE Create a SQLite database with OCR results for files in a bucket Options: --access-key ... ``` <!-- [[[end]]] --> ## Development To contribute to this tool, first checkout the code. Then create a new virtual environment: cd s3-ocr python -m venv venv source venv/bin/activate Now install the dependencies and test dependencies: pip install -e '.[test]' To run the tests: pytest To regenerate the README file with the latest `--help`: cog -r README.md	<div id="readme" class="md" data-path="README.md"><article class="markdown-body entry-content container-lg" itemprop="text"><h1 dir="auto"><a id="user-content-s3-ocr" class="anchor" href="#user-content-s3-ocr" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>s3-ocr</h1> <p dir="auto"><a href="https://pypi.org/project/s3-ocr/" rel="nofollow"><img src="https://camo.githubusercontent.com/697cb78ed5f8b2955201be9925084bf6c5603a1cebc448d174f5684ae1453d68/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f73332d6f63722e737667" alt="PyPI" data-canonical-src="https://img.shields.io/pypi/v/s3-ocr.svg" style="max-width: 100%;"></a> <a href="https://github.com/simonw/s3-ocr/releases"><img src="https://camo.githubusercontent.com/a5fa08d6edb96e5b9f55b10efcb2820fdb2168be4a766b9a38a87da938251c66/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f762f72656c656173652f73696d6f6e772f73332d6f63723f696e636c7564655f70726572656c6561736573266c6162656c3d6368616e67656c6f67" alt="Changelog" data-canonical-src="https://img.shields.io/github/v/release/simonw/s3-ocr?include_prereleases&label=changelog" style="max-width: 100%;"></a> <a href="https://github.com/simonw/s3-ocr/actions?query=workflow%3ATest"><img src="https://github.com/simonw/s3-ocr/workflows/Test/badge.svg" alt="Tests" style="max-width: 100%;"></a> <a href="https://github.com/simonw/s3-ocr/blob/master/LICENSE"><img src="https://camo.githubusercontent.com/1698104e976c681143eb0841f9675c6f802bb7aa832afc0c7a4e719b1f3cf955/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d417061636865253230322e302d626c75652e737667" alt="License" data-canonical-src="https://img.shields.io/badge/license-Apache%202.0-blue.svg" style="max-width: 100%;"></a></p> <p dir="auto">Tools for running OCR against files stored in S3</p> <p dir="auto">Background on this project: <a href="https://simonwillison.net/2022/Jun/30/s3-ocr/" rel="nofollow">s3-ocr: Extract text from PDF files stored in an S3 bucket</a></p> <h2 dir="auto"><a id="user-content-installation" class="anchor" href="#user-content-installation" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Installation</h2> <p dir="auto">Install this tool using <code>pip</code>:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="pip install s3-ocr"><pre class="notranslate"><code>pip install s3-ocr </code></pre></div> <h2 dir="auto"><a id="user-content-demo" class="anchor" href="#user-content-demo" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Demo</h2> <p dir="auto">You can see the results of running this tool against three PDFs from the Internet Archive (<a href="https://archive.org/details/unmaskingrobert00houdgoog" rel="nofollow">one</a>, <a href="https://archive.org/details/practicalmagicia00harr" rel="nofollow">two</a>, <a href="https://archive.org/details/latestmagicbeing00hoff" rel="nofollow">three</a>) in <a href="https://s3-ocr-demo.datasette.io/pages/pages?_facet=path#facet-path" rel="nofollow">this example table</a> hosted using <a href="https://datasette.io/" rel="nofollow">Datasette</a>.</p> <h2 dir="auto"><a id="user-content-starting-ocr-against-pdfs-in-a-bucket" class="anchor" href="#user-content-starting-ocr-against-pdfs-in-a-bucket" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Starting OCR against PDFs in a bucket</h2> <p dir="auto">The <code>start</code> command takes a list of keys and submits them to <a href="https://aws.amazon.com/textract/" rel="nofollow">Textract</a> for OCR processing.</p> <p dir="auto">You need to have AWS configured using environment variables, credentials file in your home directory or a JSON or INI file generated using <a href="https://datasette.io/tools/s3-credentials" rel="nofollow">s3-credentials</a>.</p> <p dir="auto">You can start the process running like this:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr start name-of-your-bucket my-pdf-file.pdf"><pre class="notranslate"><code>s3-ocr start name-of-your-bucket my-pdf-file.pdf </code></pre></div> <p dir="auto">The paths you specify should be paths within the bucket. If you stored your PDF files in folders inside the bucket it should look like this:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr start name-of-your-bucket path/to/one.pdf path/to/two.pdf"><pre class="notranslate"><code>s3-ocr start name-of-your-bucket path/to/one.pdf path/to/two.pdf </code></pre></div> <p dir="auto">OCR can take some time. The results of the OCR will be stored in <code>textract-output</code> in your bucket.</p> <p dir="auto">To process every file in the bucket with a <code>.pdf</code> extension use <code>--all</code>:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr start name-of-bucket --all"><pre class="notranslate"><code>s3-ocr start name-of-bucket --all </code></pre></div> <p dir="auto">To process every file with a <code>.pdf</code> extension within a specific folder, use <code>--prefix</code>:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr start name-of-bucket --prefix path/to/folder"><pre class="notranslate"><code>s3-ocr start name-of-bucket --prefix path/to/folder </code></pre></div> <h3 dir="auto"><a id="user-content-s3-ocr-start---help" class="anchor" href="#user-content-s3-ocr-start---help" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>s3-ocr start --help</h3> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Usage: s3-ocr start [OPTIONS] BUCKET [KEYS]... Start OCR tasks for PDF files in an S3 bucket s3-ocr start name-of-bucket path/to/one.pdf path/to/two.pdf To process every file with a .pdf extension: s3-ocr start name-of-bucket --all To process every .pdf in the PUBLIC/ folder: s3-ocr start name-of-bucket --prefix PUBLIC/ Options: --all Process all PDF files in the bucket --prefix TEXT Process all PDF files within this prefix --dry-run Show what this would do, but don't actually do it --no-retry Don't retry failed requests --access-key TEXT AWS access key ID --secret-key TEXT AWS secret access key --session-token TEXT AWS session token --endpoint-url TEXT Custom endpoint URL -a, --auth FILENAME Path to JSON/INI file containing credentials --help Show this message and exit. "><pre class="notranslate"><code>Usage: s3-ocr start [OPTIONS] BUCKET [KEYS]... Start OCR tasks for PDF files in an S3 bucket s3-ocr start name-of-bucket path/to/one.pdf path/to/two.pdf To process every file with a .pdf extension: s3-ocr start name-of-bucket --all To process every .pdf in the PUBLIC/ folder: s3-ocr start name-of-bucket --prefix PUBLIC/ Options: --all Process all PDF files in the bucket --prefix TEXT Process all PDF files within this prefix --dry-run Show what this would do, but don't actually do it --no-retry Don't retry failed requests --access-key TEXT AWS access key ID --secret-key TEXT AWS secret access key --session-token TEXT AWS session token --endpoint-url TEXT Custom endpoint URL -a, --auth FILENAME Path to JSON/INI file containing credentials --help Show this message and exit. </code></pre></div> <h2 dir="auto"><a id="user-content-checking-status" class="anchor" href="#user-content-checking-status" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Checking status</h2> <p dir="auto">The <code>s3-ocr status <bucket-name></code> command shows a rough indication of progress through the tasks:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="% s3-ocr status sfms-history 153 complete out of 532 jobs"><pre class="notranslate"><code>% s3-ocr status sfms-history 153 complete out of 532 jobs </code></pre></div> <p dir="auto">It compares the jobs that have been submitted, based on <code>.s3-ocr.json</code> files, to the jobs that have their results written to the <code>textract-output/</code> folder.</p> <h3 dir="auto"><a id="user-content-s3-ocr-status---help" class="anchor" href="#user-content-s3-ocr-status---help" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>s3-ocr status --help</h3> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Usage: s3-ocr status [OPTIONS] BUCKET Show status of OCR jobs for a bucket Options: --access-key ..."><pre class="notranslate"><code>Usage: s3-ocr status [OPTIONS] BUCKET Show status of OCR jobs for a bucket Options: --access-key ... </code></pre></div> <h2 dir="auto"><a id="user-content-inspecting-a-job" class="anchor" href="#user-content-inspecting-a-job" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Inspecting a job</h2> <p dir="auto">The <code>s3-ocr inspect-job <job_id></code> command can be used to check the status of a specific job ID:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="% s3-ocr inspect-job b267282745685226339b7e0d4366c4ff6887b7e293ed4b304dc8bb8b991c7864 { "DocumentMetadata": { "Pages": 583 }, "JobStatus": "SUCCEEDED", "DetectDocumentTextModelVersion": "1.0" }"><pre class="notranslate"><code>% s3-ocr inspect-job b267282745685226339b7e0d4366c4ff6887b7e293ed4b304dc8bb8b991c7864 { "DocumentMetadata": { "Pages": 583 }, "JobStatus": "SUCCEEDED", "DetectDocumentTextModelVersion": "1.0" } </code></pre></div> <h3 dir="auto"><a id="user-content-s3-ocr-inspect-job---help" class="anchor" href="#user-content-s3-ocr-inspect-job---help" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>s3-ocr inspect-job --help</h3> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Usage: s3-ocr inspect-job [OPTIONS] JOB_ID Show the current status of an OCR job s3-ocr inspect-job <job_id> Options: --access-key ..."><pre class="notranslate"><code>Usage: s3-ocr inspect-job [OPTIONS] JOB_ID Show the current status of an OCR job s3-ocr inspect-job <job_id> Options: --access-key ... </code></pre></div> <h2 dir="auto"><a id="user-content-fetching-the-results" class="anchor" href="#user-content-fetching-the-results" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Fetching the results</h2> <p dir="auto">Once an OCR job has completed you can download the resulting JSON using the <code>fetch</code> command:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr fetch name-of-bucket path/to/file.pdf"><pre class="notranslate"><code>s3-ocr fetch name-of-bucket path/to/file.pdf </code></pre></div> <p dir="auto">This will save files in the current directory with names like this:</p> <ul dir="auto"> <li><code>4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-1.json</code></li> <li><code>4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-2.json</code></li> </ul> <p dir="auto">The number of files will vary depending on the length of the document.</p> <p dir="auto">If you don't want separate files you can combine them together using the <code>-c/--combine</code> option:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr fetch name-of-bucket path/to/file.pdf --combine output.json"><pre class="notranslate"><code>s3-ocr fetch name-of-bucket path/to/file.pdf --combine output.json </code></pre></div> <p dir="auto">The <code>output.json</code> file will then contain data that looks something like this:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="{ "Blocks": [ { "BlockType": "PAGE", "Geometry": {...} "Page": 1, ... }, { "BlockType": "LINE", "Page": 1, ... "Text": "Barry", },"><pre class="notranslate"><code>{ "Blocks": [ { "BlockType": "PAGE", "Geometry": {...} "Page": 1, ... }, { "BlockType": "LINE", "Page": 1, ... "Text": "Barry", }, </code></pre></div> <h3 dir="auto"><a id="user-content-s3-ocr-fetch---help" class="anchor" href="#user-content-s3-ocr-fetch---help" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>s3-ocr fetch --help</h3> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Usage: s3-ocr fetch [OPTIONS] BUCKET KEY Fetch the OCR results for a specified file s3-ocr fetch name-of-bucket path/to/key.pdf This will save files in the current directory called things like a806e67e504fc15f...48314e-1.json a806e67e504fc15f...48314e-2.json To combine these together into a single JSON file with a specified name, use: s3-ocr fetch name-of-bucket path/to/key.pdf --combine output.json Use "--output -" to print the combined JSON to standard output instead. Options: -c, --combine FILENAME Write combined JSON to file --access-key ..."><pre class="notranslate"><code>Usage: s3-ocr fetch [OPTIONS] BUCKET KEY Fetch the OCR results for a specified file s3-ocr fetch name-of-bucket path/to/key.pdf This will save files in the current directory called things like a806e67e504fc15f...48314e-1.json a806e67e504fc15f...48314e-2.json To combine these together into a single JSON file with a specified name, use: s3-ocr fetch name-of-bucket path/to/key.pdf --combine output.json Use "--output -" to print the combined JSON to standard output instead. Options: -c, --combine FILENAME Write combined JSON to file --access-key ... </code></pre></div> <h2 dir="auto"><a id="user-content-fetching-just-the-text-of-a-page" class="anchor" href="#user-content-fetching-just-the-text-of-a-page" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Fetching just the text of a page</h2> <p dir="auto">If you don't want to deal with the JSON directly, you can use the <code>text</code> command to retrieve just the text extracted from a PDF:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr text name-of-bucket path/to/file.pdf"><pre class="notranslate"><code>s3-ocr text name-of-bucket path/to/file.pdf </code></pre></div> <p dir="auto">This will output plain text to standard output.</p> <p dir="auto">To save that to a file, use this:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr text name-of-bucket path/to/file.pdf > text.txt"><pre class="notranslate"><code>s3-ocr text name-of-bucket path/to/file.pdf > text.txt </code></pre></div> <p dir="auto">Separate pages will be separated by three newlines. To separate them using a <code>----</code> horizontal divider instead add <code>--divider</code>:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr text name-of-bucket path/to/file.pdf --divider"><pre class="notranslate"><code>s3-ocr text name-of-bucket path/to/file.pdf --divider </code></pre></div> <h3 dir="auto"><a id="user-content-s3-ocr-text---help" class="anchor" href="#user-content-s3-ocr-text---help" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>s3-ocr text --help</h3> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Usage: s3-ocr text [OPTIONS] BUCKET KEY Retrieve the text from an OCRd PDF file s3-ocr text name-of-bucket path/to/key.pdf Options: --divider Add ---- between pages --access-key ..."><pre class="notranslate"><code>Usage: s3-ocr text [OPTIONS] BUCKET KEY Retrieve the text from an OCRd PDF file s3-ocr text name-of-bucket path/to/key.pdf Options: --divider Add ---- between pages --access-key ... </code></pre></div> <h2 dir="auto"><a id="user-content-avoiding-processing-duplicates" class="anchor" href="#user-content-avoiding-processing-duplicates" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Avoiding processing duplicates</h2> <p dir="auto">If you move files around within your S3 bucket <code>s3-ocr</code> can lose track of which files have already been processed. This can lead to additional Textract charges for processing should you run <code>s3-ocr start</code> against those new files.</p> <p dir="auto">The <code>s3-ocr dedupe</code> command addresses this by scanning your bucket for files that have a new name but have previously been processed. It does this by looking at the <code>ETag</code> for each file, which represents the MD5 hash of the file contents.</p> <p dir="auto">The command will write out new <code>.s3ocr.json</code> files for each detected duplicate. This will avoid those duplicates being run those duplicates through OCR a second time should yo run <code>s3-ocr start</code>.</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="s3-ocr dedupe name-of-bucket"><pre class="notranslate"><code>s3-ocr dedupe name-of-bucket </code></pre></div> <p dir="auto">Add <code>--dry-run</code> for a preview of the changes that will be made to your bucket.</p> <h3 dir="auto"><a id="user-content-s3-ocr-dedupe---help" class="anchor" href="#user-content-s3-ocr-dedupe---help" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>s3-ocr dedupe --help</h3> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Usage: s3-ocr dedupe [OPTIONS] BUCKET Scan every file in the bucket checking for duplicates - files that have not yet been OCRd but that have the same contents (based on ETag) as a file that HAS been OCRd. s3-ocr dedupe name-of-bucket Options: --dry-run Show output without writing anything to S3 --access-key ..."><pre class="notranslate"><code>Usage: s3-ocr dedupe [OPTIONS] BUCKET Scan every file in the bucket checking for duplicates - files that have not yet been OCRd but that have the same contents (based on ETag) as a file that HAS been OCRd. s3-ocr dedupe name-of-bucket Options: --dry-run Show output without writing anything to S3 --access-key ... </code></pre></div> <h2 dir="auto"><a id="user-content-changes-made-to-your-bucket" class="anchor" href="#user-content-changes-made-to-your-bucket" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Changes made to your bucket</h2> <p dir="auto">To keep track of which files have been submitted for processing, <code>s3-ocr</code> will create a JSON file for every file that it adds to the OCR queue.</p> <p dir="auto">This file will be called:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="path-to-file/name-of-file.pdf.s3-ocr.json"><pre class="notranslate"><code>path-to-file/name-of-file.pdf.s3-ocr.json </code></pre></div> <p dir="auto">Each of these JSON files contains data that looks like this:</p> <div class="highlight highlight-source-json notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="{ "job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe", "etag": "\"b0c77472e15500347ebf46032a454e8e\"" }"><pre>{ <span class="pl-ent">"job_id"</span>: <span class="pl-s"><span class="pl-pds">"</span>a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe<span class="pl-pds">"</span></span>, <span class="pl-ent">"etag"</span>: <span class="pl-s"><span class="pl-pds">"</span><span class="pl-cce">\"</span>b0c77472e15500347ebf46032a454e8e<span class="pl-cce">\"</span><span class="pl-pds">"</span></span> }</pre></div> <p dir="auto">The recorded <code>job_id</code> can be used later to associate the file with the results of the OCR task in <code>textract-output/</code>.</p> <p dir="auto">The <code>etag</code> is the ETag of the S3 object at the time it was submitted. This can be used later to determine if a file has changed since it last had OCR run against it.</p> <p dir="auto">This design for the tool, with the <code>.s3-ocr.json</code> files tracking jobs that have been submitted, means that it is safe to run <code>s3-ocr start</code> against the same bucket multiple times without the risk of starting duplicate OCR jobs.</p> <h2 dir="auto"><a id="user-content-creating-a-sqlite-index-of-your-ocr-results" class="anchor" href="#user-content-creating-a-sqlite-index-of-your-ocr-results" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Creating a SQLite index of your OCR results</h2> <p dir="auto">The <code>s3-ocr index <bucket> <database_file></code> command creates a SQLite database containing the results of the OCR, and configures SQLite full-text search against the text:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="% s3-ocr index sfms-history index.db Fetching job details [####################################] 100% Populating pages table [####################----------------] 55% 00:03:18"><pre class="notranslate"><code>% s3-ocr index sfms-history index.db Fetching job details [####################################] 100% Populating pages table [####################----------------] 55% 00:03:18 </code></pre></div> <p dir="auto">The schema of the resulting database looks like this (excluding the FTS tables):</p> <div class="highlight highlight-source-sql notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="CREATE TABLE [pages] ( [path] TEXT, [page] INTEGER, [folder] TEXT, [text] TEXT, PRIMARY KEY ([path], [page]) ); CREATE TABLE [ocr_jobs] ( [key] TEXT PRIMARY KEY, [job_id] TEXT, [etag] TEXT, [s3_ocr_etag] TEXT ); CREATE TABLE [fetched_jobs] ( [job_id] TEXT PRIMARY KEY );"><pre>CREATE TABLE [pages] ( [<span class="pl-k">path</span>] <span class="pl-k">TEXT</span>, [page] <span class="pl-k">INTEGER</span>, [folder] <span class="pl-k">TEXT</span>, [<span class="pl-k">text</span>] <span class="pl-k">TEXT</span>, <span class="pl-k">PRIMARY KEY</span> ([<span class="pl-k">path</span>], [page]) ); CREATE TABLE [ocr_jobs] ( [key] <span class="pl-k">TEXT</span> <span class="pl-k">PRIMARY KEY</span>, [job_id] <span class="pl-k">TEXT</span>, [etag] <span class="pl-k">TEXT</span>, [s3_ocr_etag] <span class="pl-k">TEXT</span> ); CREATE TABLE [fetched_jobs] ( [job_id] <span class="pl-k">TEXT</span> <span class="pl-k">PRIMARY KEY</span> );</pre></div> <p dir="auto">The database is designed to be used with <a href="https://datasette.io" rel="nofollow">Datasette</a>.</p> <h3 dir="auto"><a id="user-content-s3-ocr-index---help" class="anchor" href="#user-content-s3-ocr-index---help" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>s3-ocr index --help</h3> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Usage: s3-ocr index [OPTIONS] BUCKET DATABASE Create a SQLite database with OCR results for files in a bucket Options: --access-key ..."><pre class="notranslate"><code>Usage: s3-ocr index [OPTIONS] BUCKET DATABASE Create a SQLite database with OCR results for files in a bucket Options: --access-key ... </code></pre></div> <h2 dir="auto"><a id="user-content-development" class="anchor" href="#user-content-development" aria-hidden="true"><svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M7.775 3.275a.75.75 0 001.06 1.06l1.25-1.25a2 2 0 112.83 2.83l-2.5 2.5a2 2 0 01-2.83 0 .75.75 0 00-1.06 1.06 3.5 3.5 0 004.95 0l2.5-2.5a3.5 3.5 0 00-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 010-2.83l2.5-2.5a2 2 0 012.83 0 .75.75 0 001.06-1.06 3.5 3.5 0 00-4.95 0l-2.5 2.5a3.5 3.5 0 004.95 4.95l1.25-1.25a.75.75 0 00-1.06-1.06l-1.25 1.25a2 2 0 01-2.83 0z"></path></svg></a>Development</h2> <p dir="auto">To contribute to this tool, first checkout the code. Then create a new virtual environment:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="cd s3-ocr python -m venv venv source venv/bin/activate"><pre class="notranslate"><code>cd s3-ocr python -m venv venv source venv/bin/activate </code></pre></div> <p dir="auto">Now install the dependencies and test dependencies:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="pip install -e '.[test]'"><pre class="notranslate"><code>pip install -e '.[test]' </code></pre></div> <p dir="auto">To run the tests:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="pytest"><pre class="notranslate"><code>pytest </code></pre></div> <p dir="auto">To regenerate the README file with the latest <code>--help</code>:</p> <div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="cog -r README.md"><pre class="notranslate"><code>cog -r README.md </code></pre></div> </article></div>	1	public	0		0

Links from other tables

9 rows from repo in releases