how to discover all images within an ebook and determine what they are

EPUB testing discussion for the SDPP-D Grant project 2018-19
Post Reply

how to discover all images within an ebook and determine what they are

Post by
Heidi
»
Wed Oct 31, 2018 8:22 pm
Happy Halloween everyone! I hope all of you are getting a chance to enjoy this great holiday.
I would just like to pass along some tips on how to discover and work with all images present in an EPUB file, especially if no text descriptions are available. I have come across a couple of books with images, yet the alt attribute is completely blank. By default, most screen readers won't detect these pictures unless you're specifically searching for them. The text of the book might not give you any indication that an unlabeled image is nearby. Worse, the image may contain meaningful text content that blind readers would miss out on. Alternatively, an undescribed image may simply be a decorative icon that makes the book more visually appealing.
As part of my testing workflow, I make extensive efforts to find out about all images used in the books I read. There are a few ways of going about this.
If you use Jaws for Windows, you can easily configure how graphical information is displayed within the Quick Settings dialog by typing the word "graphic" into the search field. You can choose to have Jaws show all graphics, only those that have been tagged (with alt attributes), or none. Jaws can give you descriptions of an image using it's alt attribute, longest available text info, title, OnMouseOver or custom search chriterion. For testing purposes, I have Jaws set to show all graphics. This option detects most images. If I find an image that is not well described, I will set Jaws to recognize graphics by the longest string of textual information available.
I have recently begun using a feature in NVDA called Windows 10 OCR. If you have this version of Windows or if you've installed an OCR plugin, you can use this feature to obtain the text inside an image. There may be spelling errors, but the text you receive back is usually high quality. You can use Windows 10 OCR while reading a book in browse mode. Simply have your cursor on an image, then press NVDAKey+R to use OCR on it. Alternatively, unzip the epub file, find the location where all the book's images are stored, then open up a picture in the photos app. Next, just press NVDAKey+R to OCR the image and extract all the text from it. The content you receive back from the OCR scan will be in a separate document that can be pasted to your clipboard.

To get a thorough overview of how and where undescribed images are being used, I unzip an EPUB file and look at it's underlying code and structure. If an undescribed image is always present at the beginning or ending of a chapter, and if it's filename implies that the image is an icon of some sort, I can determine that the image is probably visually decorative.
I usually search each HTML file within an unzipped EPUB document for image tags. For those of you who may not know HTML, images are inserted into a document in the following way: <img src="ImageName.jpg" alt="Sample image description">. This is a simplified example, but all image tags should contain these essential elements. The src attribute contains the location of the image within the EPUB's structure and the alt attribute contains a description of the image. If you see something like this in the image tag, alt="", you might not even locate the presence of this image with your screen reader. Some images may not even contain the alt attribute at all. When seeking out images in an HTML file, I search for the keyword "<img".
Please note: All unzipped EPUB books have an images directory that contains all pictures used in the book. Thus far, I have found that the image filenames publishers use can sometimes be ambiguous and not very descriptive.

I hope these tips will assist you in your work. If you have any questions, please let me know. I wish all of you a great evening!

Heidi
Heidi
Posts: 9
Joined: Wed Oct 03, 2018 12:29 pm
Contact:

Re: how to discover all images within an ebook and determine what they are

Post by
ka.li
»
Thu Nov 01, 2018 8:13 am
Hi Heidi,
Thanks for the tips.

A while ago, I wrote a quick post on how to use NVDA OCR for ebooks and dealing with inaccessible software for the first phase of this project back in March of this year. I'm not sure if those posts can still be accessed so I'll just pase it here.
Hi all,
I thought I would share my method of extracting text from images on the PC. One of the features of NVDA that is neat is the ability to do OCR on objects such as images. It's called Windows 10 OCR. It only supports English for now but I'm sure more languages will be added in the near future. If you've used the text features of Seeing AI, then you'll know how good the OCR is. Well, NVDA uses Microsoft's OCR so when you do OCR on an object, you should get good results.

To use Windows 10 OCR, you'll need to be running Windows 10 and the latest copy of NVDA.

First, navigate to an object such as a graphic that you might find on a web page using standard keyboard commands. then press insert-r and a virtual document window will pop up containing the extracted text which you can review with your arrow keys. Sometimes, you might get a tiny piece of text. What may help in that case is to maximize your window with alt-space then x.

You can use this functionality for more than just graphics. You can use to this navigate inaccessible programs by focusing on the navigator object which may be the main window of a program and then doing OCR on it. In the virtual document window, you can route the mouse pointer to where you want to click and then left click on it. The NVDA manual will have more information on object Nav, mouse pointers, and Windows 10 OCR if you're interested in exploring this further.

But that's basically what I used to extract text. In the back of Hit the Ground Running, you may recall there were images of other books, OCR did a good enough job to let me know what they were but it wasn't perfect so I hunted for those books on the publisher's website and found that the book descriptions were similar to the text in the images.
Hope this helps.
ka.li
Posts: 68
Joined: Tue Feb 06, 2018 2:36 pm
Contact:

Re: how to discover all images within an ebook and determine what they are

Post by
Heidi
»
Thu Nov 01, 2018 12:55 pm
Thank you for your very helpful post! The more I use Windows OCR, the more I love it's accuracy and ease of use. I'll definitely check out a few inaccessible apps with OCR. It can make such a big difference.
Heidi
Posts: 9
Joined: Wed Oct 03, 2018 12:29 pm
Contact:

Re: how to discover all images within an ebook and determine what they are

Post by
Danny
»
Thu Nov 01, 2018 2:29 pm
Good afternoon all,

First of all, Heidi, thank you! I wasn't aware of NVDA's incorporation of Windows OCR! I can't wait to try it.

I want to give another shout-out for unpacking the book's code directly, instead of converting it with something like Codex. I'm comparing the results of Codex with PanDoc, and will post my results in the ePub Reports DropBox folder a little later. But so far, I've found unzipping the ePub archive to be the most accurate way to test its coding.

You can find lists by searching for <ul> or <li . Note the <li> tag often has other markup in it so it's important to search for <li followed by a space.

I like to perform a multi-file search on the various files in the unpacked ePub archive, to insure I'm finding all tags I'm interested in across all the files. Then, I can open up an files of note and search for those tags to get them in context.

I've found page number marking, like everything else, to vary greatly. Usually I just search for page followed by a space, that usually turns them up. Of course, they're supposed to be within a <Span class, but often publishers will put this data in paragraph or even anchor tags. (Groan)

ePub:type semantics are another great way to help with accessibility, but most of the books I review don't utilize them. But if you find an <epub:type tag somewhere in the document outside of the table of contents, that's usually a good sign. Too often, chapters are simply wrapped in <div> tags instead of the ePub 3 rich <section> tag.

ARIA roles are really powerful. To Heidi's point of having to analyze an image with an empty Alt tag to determine whether or not it's decorative, a roll of decoration would tell the reader it's meaningless to us right away. They seem to be pretty rare, though.

I wish publishers would be more careful with their <Title> tags. These are found in the Head of each XHTML document, right near the top. Usually, they just repeat the book's title, though sometimes they get creative with filename, gobbly-gook, or even its ISBN. Of course, this tag is supposed to contain the title of the chapter or section in that particular document.

So, don't be afraid to change the extension on your ePub from .epub to .zip, and extract all its contents. Windows won't like it - it complains bitterly when you change the extension on a file. I do it in DOS! :) But it's really worth doing.

Please give me a shout if you'd like clarification on any of this. I certainly have - and have already benefited greatly from your combined knowledge and patience!
Danny
Posts: 31
Joined: Thu Oct 04, 2018 9:17 am
Contact:

Re: how to discover all images within an ebook and determine what they are

Post by
Karoline
»
Sun Nov 04, 2018 2:30 pm
I! Finally have a computer that works! Well, mostly. Anyway, sso, just to confirm, what you do, you'll go into your file Explorer and just changed the extension. Yes? Do you then have to run it through other software? Or, can you unzip it in Windows ? sounds really silly. I should know this. LOL I finally managed to download Codex once again. I am not quite comfortable with HTML yet. Thank you for the clarifications. A great Sunday night!
Karoline
Posts: 52
Joined: Sun Feb 04, 2018 9:31 am
Contact:

Re: how to discover all images within an ebook and determine what they are

Post by
ka.li
»
Mon Nov 05, 2018 9:49 am
Hi Karoline,
There are two ways to do this. You can change the extension of the ePub to a ".zip" file or if you use 7zip you can just go through the regular workflow of extracting using it since it recognizes ePub as a container.
Hope this helps.
ka.li
Posts: 68
Joined: Tue Feb 06, 2018 2:36 pm
Contact:

Re: how to discover all images within an ebook and determine what they are

Post by
rmarion
»
Mon Nov 05, 2018 11:15 am
Hello. While I am not as versed as some of you with this much technical detail, I was asked to use the Mac to review the ePub files. I did manage to find a program available for the Mac to unpack the ePub files called ecancrusher. It does work for the most part, but for some reason it doesn't accept the VoiceOver mark for drag and drop option. The program requires you to drop the ePub file on the program icon and then it creates a folder with the unpacked files and folders. Once this is done, you can view the html and image files for example.

Since the program is not accepting the voiceover command for dragging and dropping, I do end up using my limited vision to do a drag and drop using my track pad for now. If I find a work around, I will let you know.
rmarion
Posts: 62
Joined: Fri Feb 02, 2018 1:08 pm
Contact:

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests