quarta-feira, maio 17, 2017
Reconhecendo texto em um pdf scaneado com python e tesseract
Como converter um arquivo pdf scaneado em um arquivo pdf que permita a busca pelo texto?
As orientações são para o ubuntu 16.04
1) Instale o pypdfocr
sudo pip install pypdfocr
2) instale o tesseract e o language pack
sudo apt install tesseract-ocr tesseract-por
3) para converter um arquivo
pypdfocr -l por scaneado.pdf
irá gerar um arquivo scaneado_ocr.pdf.
Para mais opções, consulte a documentação em
http://virantha.github.io/pypdfocr/html/
---------------------------------------------------------------------------------------
Para instalação no Centos 7, siga as orientações de
http://www.keienberg.com/install-tesseract-3-04-centos-7/
Se reclamar 'leptonic not found' no passo 3, use:
$ setenv LIBLEPT_HEADERSDIR /usr/local/include/leptonica ; setenv LDFLAGS -L/usr/lib ; ./configure --prefix=/usr
make install e ldconfig devem ser executados como
sudo make install
sudo ldconfig
no passo 5, faça
export TESSDATA_PREFIX=/usr/share/tessdata
----------------------------------------------------------------------------------
Versão corrigida
Tesseract installation is supported beautifully with Ubuntu, but with Centos it requires effort to build. Below is a description of how to install Tesseract on CentOs.
Used versions:
Tesseract: 3.04.01 tesseract-3.04.01.tar.gz
Leptonica: 1.73 leptonica-1.73.tar.gz
Tesseract-ocr 3.02 tesseract-ocr-3.02.deu.tar.gz, tesseract-ocr-3.02.eng.tar.gz, tesseract-ocr-3.02.nld.tar.gz, tesseract-ocr-3.02.por.tar.gz
GhostScript: Install Tesseract 3.04 on CentOs 7
I executed all commands as root, but if you prefer, you can use another account and ‘sudo‘ the commands
1) First update your system:
yum update
Because Tesseract-ocr is not available using yum, we need to download source and build both Tesseract-ocr and leptonica.
This requires development tools to be installed.
yum groupinstall “Development Tools”
Se der erro use:.
yum groupinstall "Ferramentas de desenvolvimento"
yum -y install automake autoconf libtool zlib-devel libjpeg-devel giflib libtiff-devel libwebp libwebp-devel libicu-devel openjpeg-devel cairo-devel
2) Now download and install Leptonica :
wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
tar xzvf leptonica-1.73.tar.gz
cd leptonica-1.73
./configure
make
make install
3) Download and install Tesseract:
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
mv 3.04.01.tar.gz tesseract-3.04.01.tar.gz
tar xzvf tesseract-3.04.01.tar.gz
cd tesseract-3.04.01/
./autogen.sh
setenv LIBLEPT_HEADERSDIR /usr/local/include/leptonica ; setenv LDFLAGS -L/usr/lib ; ./configure --prefix=/usr
./configure
make
make install
ldconfig
make training
make training-install
4) Download and install Tesseract trainer files:
wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.eng.tar.gz
wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.nld.tar.gz
wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.deu.tar.gz
wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.por.tar.gz
tar xzvf tesseract-ocr-3.02.eng.tar.gz
tar xzvf tesseract-ocr-3.02.nld.tar.gz
tar xzvf tesseract-ocr-3.02.deu.tar.gz
tar xzvf tesseract-ocr-3.02.por.tar.gz
cp -r tesseract-ocr/tessdata/ /usr/share/
5) Export TESSDATA_PREFIX:
export TESSDATA_PREFIX=/usr/share/tessdata
6) Last, install Ghostscript for processing png:
wget https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs920/ghostscript-9.20.tar.gz
tar xzvf ghostscript-9.20.tar.gz
cd ghostscript-9.20/
./autogen.sh
./configure
make
make install
if tesseract complains about not find liblept.so.5 , do:
ln -s /usr/local/lib/liblept.so.5 /usr/lib64/
Assinar:
Postagens (Atom)