Seven Minute Server

Oct 17, 2019 - 3 minute read

EOAT's got a Docker image!

EOAT now has a Docker image! It’s huge, but for good reason: I’ve installed Tesseract and its dependencies, pandoc, TexLive, required components for third party translation (gcloud, trans-shell, boto3), all the EOAT tools, and more. Because it can take hours to get everything configured and installed with dependencies in the correct order, this should speed everything up.

If you give this a whirl and have comments/bugs/issues/requests for an even more ridiculously giant docker image, please drop me a line @ jen@sevenminuteserver.com.

So here’s how you use it:

  1. Install docker (sudo [yum|apt-get] install docker on Linux, and however y’all do it on Windows). If you get permissions problems running docker on Linux, make sure you add yourself to the docker group: sudo usermod -aG docker $(whoami) and log out and back in again, then sudo service docker start.

  2. Pull my docker image:

    docker pull jenh/eoat:v2
    
  3. List the images to make sure you’ve got it:

    you@host:~$ docker images
      
     REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
    jenh/eoat           v2                  af26663f6b4f        About an hour ago   6.04GB
    
  4. Run an instance, but sure to use -d, as the instance will shut down immediately if nothing’s running on it; and nothing will be running on it at first…

    docker run -t -d --name EOAT af26663f6b4f
    
  5. Copy the PDF you want to OCR and translate (note, you can also connect to the image and wget it: wget https://wherever.wherever/myfile.pdf):

    docker cp myfile.pdf EOAT:/root/
    
  6. Log into the image:

    docker exec -it EOAT bash
    
  7. Copy the Tesseract language files you need. If you are using English, Russian, French, and/or Spanish, you can skip this step. For a full listing of the files available see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files. For example, the following downloads Portuguese model files:

    cd /usr/local/share/tessdata
    wget https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata
    
  8. Navigate to where you saved the PDF and get started, where eng is the three-letter language code:

    cd ~/ && eoat-ocr myfile.pdf eng
    
  9. Wait awhile and let Tesseract work. When finished, eoat-ocr will write the data to a text file. You may want to open it and clean it up, you’ll see things like page numbers and the like that may need to be removed, images that produce gobbledygook, etc.

  10. Translate. The following will use free Google Translate to translate English into Spanish using translate-shell with a wait time between lines of 4 seconds (the default is 2): but you can change this engine to a paid engine (which doesn’t typically cut you off!): -e gcloud to use Google Cloud, -e amazon for AWS:

    eoat-trans -i myfile.txt -s en -t es -w 4
    

    You can change this to a paid engine, which will still cut you off. It’s worth setting -w to 1 and not 0. For Google Cloud, copy your credentials file (JSON format) to the docker image, then export GOOGLE_APPLICATION_CREDENTIALS=my-translate-creds.json and use -e gcloud. For AWS, edit /root/.aws/credentials and customize to your region and access information, then use -e amazon:

    [default]
    region = us-east-1
    aws_access_key_id = your_aws_key
    aws_secret_access_key = your_aws_secret_key
    
  11. Split the output into separate files per chapter, if applicable. Use -d to specify the delimiter to use to break the document into sections.

    eoat-split -i myfile.txt-en-es.txt -d "Chapter"
    
  12. Run eoat-make to create a makefile

  13. Run eoat-build en to build English deliverables, eoat-build es to build Spanish deliverables, etc. If you skipped the translation step, you can just run eoat-build.

    Output will be saved in book_en.epub (for an English epub) and book_en.pdf (for an English PDF).