Captcha Breaker
2021-03-25
I recently heard about OCR (optical character recognition) from one of my friends. He said he was getting a machine for his company and that it would save time processing all the paperwork over there. I was curious and sure enough found the tesseract library in python for OCR. The first use case I thought of was for breaking captchas.
It turns out the tesseract library doesn’t do too well with captchas right out of the box but with a little help from imagemagick It could do OK. Still not 100% accurate, but even humans sometimes can’t read captchas properly. I think with the right imagemagick uh… magic you could easily break most captchas found on the internet (not google recaptcha though).
Here is some sample code:
import pytesseract
import sys
import argparse
try:
import Image
except ImportError:
from PIL import Image
from subprocess import check_output
def resolve(path):
print("Resampling the Image",path)
new_path = "new"+path
# image processing with imagemagick
out = check_output(['convert', path, '-resample', '600', new_path])
# the above line is where you need to get creative a try to process the image so it is easy to perform OCR
return pytesseract.image_to_string(Image.open(new_path))
if __name__=="__main__":
argparser = argparse.ArgumentParser()
argparser.add_argument('path',help = 'Captcha file path')
args = argparser.parse_args()
path = args.path
print('Resolving Captcha')
captcha_text = resolve(path)
print('Extracted Text',captcha_text)
Here is the github repo where you can try it out with a sample captcha: https://github.com/karangejo/captcha-breaker