nixpkgs/pkgs/applications/graphics/tesseract/4.x.nix

62 lines
2 KiB
Nix
Raw Normal View History

tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses *only* OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org>
2017-04-11 02:30:45 +02:00
{ stdenv, fetchFromGitHub, autoreconfHook, autoconf-archive, pkgconfig
, leptonica, libpng, libtiff, icu, pango, opencl-headers
# Supported list of languages or `null' for all available languages
, enableLanguages ? null
}:
stdenv.mkDerivation rec {
name = "tesseract-${version}";
version = "4.0.0";
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses *only* OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org>
2017-04-11 02:30:45 +02:00
src = fetchFromGitHub {
owner = "tesseract-ocr";
repo = "tesseract";
rev = version;
sha256 = "1b5fi2vibc4kk9b30kkk4ais4bw8fbbv24bzr5709194hb81cav8";
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses *only* OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org>
2017-04-11 02:30:45 +02:00
};
tessdata = fetchFromGitHub {
owner = "tesseract-ocr";
repo = "tessdata";
rev = version;
sha256 = "1chw1ya5zf8aaj2ixr9x013x7vwwwjjmx6f2ag0d6i14lypygy28";
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses *only* OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org>
2017-04-11 02:30:45 +02:00
};
nativeBuildInputs = [ pkgconfig autoreconfHook autoconf-archive ];
buildInputs = [ leptonica libpng libtiff icu pango opencl-headers ];
# Copy the .traineddata files of the languages specified in enableLanguages
# into `$out/share/tessdata' and check afterwards if copying was successful.
postInstall = let
mkArg = lang: "-iname ${stdenv.lib.escapeShellArg "${lang}.traineddata"}";
mkFindArgs = stdenv.lib.concatMapStringsSep " -o " mkArg;
findLangArgs = if enableLanguages != null
then "\\( ${mkFindArgs enableLanguages} \\)"
else "-iname '*.traineddata'";
in ''
numLangs="$(find "$tessdata" -mindepth 1 -maxdepth 1 -type f \
${findLangArgs} -exec cp -t "$out/share/tessdata" {} + -print | wc -l)"
${if enableLanguages != null then ''
expected=${toString (builtins.length enableLanguages)}
'' else ''
expected="$(ls -1 "$tessdata/"*.traineddata | wc -l)"
''}
if [ "$numLangs" -ne "$expected" ]; then
echo "Expected $expected languages, but $numLangs" \
"were copied to \`$out/share/tessdata'" >&2
exit 1
fi
'';
meta = {
description = "OCR engine";
homepage = https://github.com/tesseract-ocr/tesseract;
tesseract: Package version 4.x from Git master Tesseract 4 has got a new long short-term memory neural networking based OCR engine which really helps a lot in terms of accuracy and our VM tests. I ran the new version across a bunch of different screenshots and comparing the results to the 3.x branch and it really makes a big difference, especially with various font rendering settings. The only downside of this is that version 4 hasn't been released yet and is in alpha state right now, but it will eventually get there and the only solutions that came into my mind sticking to version 3 were really sub-par: * Use several passes with different color negation on the screenshots. * Train Tesseract 3 specifically for screenshots. This is sub-par because we'd need to do it for Tesseract 4 from scratch again. * Change the test systems so that it specifically uses *only* OCR an font when displaying. I've actually tried this but this also isn't accurate enough with our default font rendering setup. * Turn off special font rendering settings for our tests. In conjunction with changing to an OCR font this might work but it won't catch all the cases, because applications might use their own font rendering. Given that version 4 is faster[1] when it comes to OCR detection and also the points just mentioned I think even using the alpha version just for tests isn't going to hurt anybody. [1]: https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Signed-off-by: aszlig <aszlig@redmoonstudios.org>
2017-04-11 02:30:45 +02:00
license = stdenv.lib.licenses.asl20;
maintainers = with stdenv.lib.maintainers; [viric];
platforms = with stdenv.lib.platforms; linux;
};
}