Pandoc

Daniella.LP · Post by Daniella.LP » Thu Feb 08, 2018 4:05 pm

Using Pandoc to convert .epub files to other formats

Pandoc is a universal document converter that can be installed in either Windows or Mac computers. For the most part, the tool preserves the original format. See https://pandoc.org/

I run Pandoc either from the command prompt or Windows PowerShell and both work the same.

To convert the .epub file of Hit the Ground Running into a .docx and .html formats I did the following:

1. Put the file in a separate folder and renamed it to something simpler, preserving the epub extension (a simpler name helps when entering the specific command to convert); I renamed the file to "Running.epub".

2. Ensured I was in the correct working directory, where I had put the file. In the command line or Powershell I typed "cd documents" and pressed enter; because my file was in a subdirectory, I used the same command to change directory: typed "cd foldername"

3. To convert to .docx I typed the following line (from the command line or Powershell).

"pandoc -s Running.epub -o Running.docx"

To convert to html the command is almost the same, except for the file extension in the last part:

"pandoc -s Running.epub -o Running.html"

The "-s" is to produce a standalone file; "-o" is to indicate the output, followed by the file name with the desired extension type.

When I opened the resulting .html file in IE, I noticed a random link named exactly as the folder where I had my files, but it did not open anything when I clicked on it. So I opened the same html file in Firefox, Chrome and Safari and I tried to find the link, but it is not there. When I activate a list of links, I can find it in IE but not in any of the other brousers. I assume it is an issue with IE reading something else, but I don't know what.

Unfortunately, Pandoc does not convert files from .pdf. I usually run those first through Kurzweil 1000, or open them in a PDF viewer. Kurzweil seems to OCR the .pdf files, even if they are text-based.

steve.murgaski · Post by steve.murgaski » Thu Feb 15, 2018 2:25 pm

Hi, this is a little off-topic, but in Kurzweil version 14 there's an option to emphasize embedded text instead of recognized images when you open a .pdf file. By far my favorite program for reading .pdf in Windows is QRead, but it's not free and it can't do OCR. So maybe telling Kurzweil to emphasize text would be a good option.

I'm impressed you use Windows PowerShell. That's been on my list of things to check out for awhile.

Daniella.LP · Post by Daniella.LP » Fri Feb 23, 2018 4:50 am

Thank you for the info about Kurzweil 14 and QRead, I will definitely check that one out; even if it is not free, I need a better way of reading PDF files.

I am still learning to use Windows Powershell; I mostly type commands, I find it a bit challenging to edit. The JAWS command to virtualize window (Insert+Alt+W) is very useful to review. When in a command line and within Powershell, this will place all the contents into a similar environment as a website would be presented
in JAWS. I can use all the regular jaws reading commands to read the content, and then press escape to return to the command line proper. Sometimes when I need to type a long command (for example to use Pandoc with many different flags), I type the code and review it in a text editor, I copy it to the clipboard and then move to the command line or Powershell, where I paste it.

Post by farrah » Fri Feb 23, 2018 12:04 pm

@Daniellapl I tried using pandoc to convert from epub to html, and it works like a charm. I'm really liking pandoc. It would be great now if we could have an automated way of mapping the resulting html to proper html5 (as required in the EPUB 3 spec). For example, often times publishers use class attributes to identify content instead of the proper section elements with attached epub:type or aria roles. So, for example, you might see something like <div class="copyright">, and we'd need to convert this to a semantically meaningful structural element like <section epub:type="copyright-page">... We'd also want to map things like page number breaks and footnotes to their proper EPUB 3 equivalents, et cetera...

Daniellalpl wrote: ↑Thu Feb 08, 2018 4:05 pm Using Pandoc to convert .epub files to other formats

Pandoc is a universal document converter that can be installed in either Windows or Mac computers. For the most part, the tool preserves the original format. See https://pandoc.org/

I run Pandoc either from the command prompt or Windows PowerShell and both work the same.

To convert the .epub file of Hit the Ground Running into a .docx and .html formats I did the following:

1. Put the file in a separate folder and renamed it to something simpler, preserving the epub extension (a simpler name helps when entering the specific command to convert); I renamed the file to "Running.epub".

2. Ensured I was in the correct working directory, where I had put the file. In the command line or Powershell I typed "cd documents" and pressed enter; because my file was in a subdirectory, I used the same command to change directory: typed "cd foldername"

3. To convert to .docx I typed the following line (from the command line or Powershell).

"pandoc -s Running.epub -o Running.docx"

To convert to html the command is almost the same, except for the file extension in the last part:

"pandoc -s Running.epub -o Running.html"

The "-s" is to produce a standalone file; "-o" is to indicate the output, followed by the file name with the desired extension type.

When I opened the resulting .html file in IE, I noticed a random link named exactly as the folder where I had my files, but it did not open anything when I clicked on it. So I opened the same html file in Firefox, Chrome and Safari and I tried to find the link, but it is not there. When I activate a list of links, I can find it in IE but not in any of the other brousers. I assume it is an issue with IE reading something else, but I don't know what.

Unfortunately, Pandoc does not convert files from .pdf. I usually run those first through Kurzweil 1000, or open them in a PDF viewer. Kurzweil seems to OCR the .pdf files, even if they are text-based.

NNELS

Pandoc

Pandoc

Daniella.LP

Re: Pandoc

steve.murgaski

Re: Pandoc

Daniella.LP

Re: Pandoc

farrah

Who is online