Loading...

Data Scraping from Image using Tesseract

View: 133    Dowload: 0   Comment: 0   Post by: admin   Category: Web Development   Fields: Other

Download  data-scraping-from-image.zip (81.15 KB)

Demo link: https://sourceforge.net/projects/jati/?source=navbar

You need to Sign In to download the file data-scraping-from-image.zip
If you do not have an account then Sign up for free here

Scrape data from image using Tesseract OCR engine.

Introduction

Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.

In this article, I am going to introduce you to Optical Character Recognition (OCR) to convert images to text. I developed Just Another Tesseract Interface (JATI) to convert images into text files, and consolidate them into a set of text data for text mining and natural language processing.

JATI interface with Tesseract OCR engine to convert image into text. I have included the source code. In this article, I am going to explain interfacing of the popular open source Tesseract OCR engine using C#.

Selecting the Image Portion to Convert

To OCR the whole image, it is easy, but I want to select a portion of the image to OCR. This can improves the accuracy of the result also. Hence, in JATI, user can click on the picturebox image and drag to draw a rectangle to select the portion. The selected area will then be cropped.  The following are the steps to accomplish this.

Reference:

1. http://www.c-sharpcorner.com/UploadFile/hirendra_singh/how-to-make-image-editor-tool-in-C-Sharp-cropping-image/

2. https://stackoverflow.com/questions/34551800/get-the-exact-size-of-the-zoomed-image-inside-the-picturebox

Include the System.Drawing library:

using System.Drawing;

Mouse Down event for PictureBox1:

void PictureBox1MouseDown(object sender, MouseEventArgs e)
        {
            try {
           
             if (e.Button == System.Windows.Forms.MouseButtons.Left)
             {
                 Cursor = Cursors.Cross;
                startX = e.X;
                startY = e.Y;
               
                selPen = new Pen(Color.Red, 1);
              }
             
             pictureBox1.Refresh();
            }
           
            catch(Exception ex) {
               
            }
        }

Mouse Move event for PictureBox1:

void PictureBox1MouseMove(object sender, MouseEventArgs e)
        {
            try {
            if(e.Button == System.Windows.Forms.MouseButtons.Left) {
                pictureBox1.Refresh();   
                //Cursor = Cursors.Cross;
                curX = e.X;
                curY = e.Y;
               
                Rectangle rect = new Rectangle(startX, startY, curX - startX, curY - startY);
                pictureBox1.CreateGraphics().DrawRectangle(selPen, rect);
               
               
            }
           
            }
           
            catch(Exception ex) {
               
            }
           
        }

Mouse Up event for PictureBox1:

void PictureBox1MouseUp(object sender, MouseEventArgs e)
        {
            try {
            Cursor = Cursors.Arrow;
       
            Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);
          
            Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);
            Bitmap _img = new Bitmap(curX-startX, curY-startY);

            Graphics g = Graphics.FromImage(_img);

            g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
            g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
            g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

            g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
 
            pictureBox2.Image = _img;
            pictureBox2.SizeMode = PictureBoxSizeMode.Zoom;
            pictureBox2.Width = _img.Width;
            pictureBox2.Height = _img.Height;
              
            }
           
            catch(Exception ex) {
               
            }
        }

The above codes crop the selected image portion and place it into picturebox2. The following is the detailed explanations.

Create a new rectangle object for the selection:

Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);

Save the original image into a Bitmap object:

Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);

Create a new Bitmap Object:

Bitmap _img = new Bitmap(curX-startX, curY-startY);

Create a Graphics Object based on the new Bitmap Object:

Graphics g = Graphics.FromImage(_img);

Settings of Graphics Object:

g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

Cropped the image based on selection and put into pictureBox2:

g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
pictureBox2.Image = _img;

To get the selected coordinates for the image, I use:

string selCoordinates = "(" + startX.ToString() + "," + startY.ToString() + "," + curX.ToString() + "," + curY.ToString() + ")";

Image to Text Recognition using Tesseract

I use Tesseract OCR engine to convert images into text. To interface with Tesseract OCR engine, include System.Diagnostic library:

using System.Diagnostics;

Save the cropped image selection from pictureBox2 into a temporary directory:

pictureBox2.Image.Save(Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png");

Set the input file and output file for Tesseract OCR engine:

string input = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png";
string output = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".txt";

Create the Process and put in the arguments:

Process myProcess = Process.Start(Directory.GetCurrentDirectory() + "/JATI/tesseract.exe", "--tessdata-dir ./JATI/ " + input + " " + output.Replace(".txt", "") + " -l " + languageTextBox.Text + " -psm " + psmTextBox.Text);

Wait for the process to exit:

myProcess.WaitForExit();

Data Scraping from Image using Tesseract

Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.

Posted on 02-04-2018 

Comment:

To comment you must be logged in members.

Files with category

  • Twitter-like Hashtag Function in PHP

    Twitter-like Hashtag Function in PHP

    View: 0    Download: 0   Comment: 0

    Category: Php&mySql     Fields: none

    If you ever want a function to hashtag and style words within a string which has '#' next to it like twitter. This piece of code will help

  • Mini Youtube Using ReactJS

    Mini Youtube Using ReactJS

    View: 25    Download: 2   Comment: 0

    Category: Javascript     Fields: none

    This is one the best starter for ReactJS. MiniYoutube as the name suggests is a youtube like website developed using reactJS and youtube API. This project actually let's you search , play and list youtube videos. Do check it out and start learning...

  • PSITS Automated Voting System

    PSITS Automated Voting System

    View: 24    Download: 3   Comment: 0

    Category: Php&mySql     Fields: none

    A free sourcecode for PSITS Automated Voting System develop in PHP programming language. The purposed of the system is to automate the process of voting and maintain the quality of data

  • Document Management System in VB.Net

    Document Management System in VB.Net

    View: 24    Download: 1   Comment: 0

    Category: Forum PHPBB, VBB     Fields: none

    This is a Client Server project entitled Document Management System written in Vb.net and SQL Server Management Studio R2 2008 database for Mines and Geosciences Bureau . There are three (3) sections in the Mines and Geosciences Bureau, namely: under...

  • PHP-University Application System

    PHP-University Application System

    View: 30    Download: 6   Comment: 0

    Category: Php&mySql     Fields: none

    Greeting from Malawi the warm heart of Africa. I developed this university registration system project just to share with the people who can make use of the project like this and at the same time for others like students who can learn from it. This...

  • School Event Management System in PHP/MSQLi

    School Event Management System in PHP/MSQLi

    View: 22    Download: 0   Comment: 0

    Category: Php&mySql     Fields: none

    This School Event Management System can create school events such as Volleyball games, Basketball, Cultural presentation, Election of school officers etc . During school election Instead of having a compile list of candidates and voters this system...

  • Resort Reservation System with PayPal/Credit Card/Debit Card Payment

    Resort Reservation System with PayPal/Credit Card/Debit Card Payment

    View: 27    Download: 0   Comment: 0

    Category: Php&mySql     Fields: none

    This reservation system has the ability to help its customers find available rooms, cottages and even function hall for their convenience . And in here, they will also have the idea of the room rates where they can quickly reserve for their family...

  • Activity log

    Activity log

    View: 20    Download: 0   Comment: 0

    Category: Php&mySql     Fields: none

    Simple program to track user's activity log-in time and online/offline status.

 
File suggestion for you
Loading...
File top downloads
Loading...
Loading...
Codetitle - library source code to share, download the file to the community
Copyright © 2018. All rights reserved. codetitle Develope by Vinagon .Ltd