quarta-feira, maio 17, 2017

Reconhecendo texto em um pdf scaneado com python e tesseract


Como converter um arquivo pdf scaneado em um arquivo pdf que permita a busca pelo texto?

As orientações são para o ubuntu 16.04

1) Instale o pypdfocr

sudo pip install pypdfocr

2) instale o tesseract e o language pack

sudo apt install tesseract-ocr tesseract-por

3) para converter um arquivo

pypdfocr -l por scaneado.pdf

irá gerar um arquivo scaneado_ocr.pdf.


Para mais opções, consulte a documentação em

http://virantha.github.io/pypdfocr/html/ 

---------------------------------------------------------------------------------------
Para instalação no Centos 7, siga as orientações de

http://www.keienberg.com/install-tesseract-3-04-centos-7/

Se reclamar 'leptonic not found' no passo 3, use:

$ setenv LIBLEPT_HEADERSDIR /usr/local/include/leptonica ; setenv LDFLAGS -L/usr/lib ; ./configure --prefix=/usr

make install e ldconfig devem ser executados como

sudo make install
sudo ldconfig

no passo 5, faça

export TESSDATA_PREFIX=/usr/share/tessdata

----------------------------------------------------------------------------------

Versão corrigida

Tesseract installation is supported beautifully with Ubuntu, but with Centos it requires effort to build. Below is a description of how to install Tesseract on CentOs.
Used versions:
Tesseract: 3.04.01 tesseract-3.04.01.tar.gz
Leptonica: 1.73 leptonica-1.73.tar.gz
Tesseract-ocr 3.02 tesseract-ocr-3.02.deu.tar.gz, tesseract-ocr-3.02.eng.tar.gz, tesseract-ocr-3.02.nld.tar.gz, tesseract-ocr-3.02.por.tar.gz
GhostScript: Install Tesseract 3.04 on CentOs 7

I executed all commands as root, but if you prefer, you can use another account and ‘sudo‘ the commands

1) First update your system:
yum update
Because Tesseract-ocr is not available using yum, we need to download source and build both Tesseract-ocr and leptonica.
This requires development tools to be installed.
yum groupinstall “Development Tools”

Se der erro use:.

yum groupinstall "Ferramentas de desenvolvimento"

yum -y install automake autoconf libtool zlib-devel libjpeg-devel giflib libtiff-devel libwebp libwebp-devel libicu-devel openjpeg-devel cairo-devel

2) Now download and install Leptonica :
wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
tar xzvf leptonica-1.73.tar.gz
cd leptonica-1.73
./configure
make 
make install 

3) Download and install Tesseract:
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
mv 3.04.01.tar.gz tesseract-3.04.01.tar.gz
tar xzvf tesseract-3.04.01.tar.gz
cd tesseract-3.04.01/
./autogen.sh
setenv LIBLEPT_HEADERSDIR /usr/local/include/leptonica ; setenv LDFLAGS -L/usr/lib ; ./configure --prefix=/usr
./configure
make
make install
ldconfig

make training
make training-install
 


4) Download and install Tesseract trainer files:
wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.eng.tar.gz
wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.nld.tar.gz
wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.deu.tar.gz
wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.por.tar.gz
tar xzvf tesseract-ocr-3.02.eng.tar.gz
tar xzvf tesseract-ocr-3.02.nld.tar.gz
tar xzvf tesseract-ocr-3.02.deu.tar.gz
tar xzvf tesseract-ocr-3.02.por.tar.gz

cp -r tesseract-ocr/tessdata/ /usr/share/
 

 
5) Export TESSDATA_PREFIX:
export TESSDATA_PREFIX=/usr/share/tessdata
6) Last, install Ghostscript for processing png:

wget https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs920/ghostscript-9.20.tar.gz

tar xzvf ghostscript-9.20.tar.gz

cd ghostscript-9.20/
./autogen.sh
./configure
make
make install

if tesseract complains about not find liblept.so.5 , do:

 ln -s /usr/local/lib/liblept.so.5 /usr/lib64/






Nenhum comentário: