diff --git a/www/contribute/producing-an-ebook-step-by-step.php b/www/contribute/producing-an-ebook-step-by-step.php index 49fd6058..77200f66 100644 --- a/www/contribute/producing-an-ebook-step-by-step.php +++ b/www/contribute/producing-an-ebook-step-by-step.php @@ -18,9 +18,38 @@ require_once('Core.php');
If you commingle editorial changes with other changes in your commits, we’ll be forced to ask you to rebase your repository to tease them out. This is very difficult and you’ll get frustrated—so please make sure to keep editorial commits separate!
If your working directory contains a mix of changes and you only want to commit some of them, git add --patch
is a useful way to only commit parts of a file.
Set up the Standard Ebooks toolset and make sure it’s up-to-date
+ +Locate page scans of your book online
+Create a Standard Ebooks epub skeleton
+Do a rough cleanup of the source text and perform the first commit
+Split the source text at logical divisions
+Clean up the source text and perform the second commit
+Typogrify the source text and perform the corresponding commit(s)
+Check for transcription errors
+ + +Converting British quotation to American quotation
+ +Modernize spelling and hyphenation
+Check for consistent diacritics
+ + + + + +Build and proofread, proofread, proofread!
+ + + + + +Standard Ebooks has a toolset that will help you produce an ebook. The toolset installs the se
command, which has various subcommands related to creating Standard Ebooks. You can read the complete installation instructions, or if you already have pipx
installed, run:
pipx install standardebooks
The toolset changes frequently, so if you’ve installed the toolset in the past, make sure to update the toolset before you start a new ebook:
@@ -29,7 +58,7 @@ require_once('Core.php');se --version
The best place to look for public domain ebooks to produce is Project Gutenberg. If downloading from Project Gutenberg, be careful of the following:
For this guide, we’ll use The Strange Case of Dr. Jekyll and Mr. Hyde, by Robert Louis Stevenson. If you search for it on Gutenberg, you’ll find that there are two versions; the most popular one is a poor choice to produce, because the transcriber included the page numbers smack in the middle of the text! What a pain those’d be to remove. The less popular one is a better choice to produce, because it’s a cleaner transcription.
As you produce your book, you’ll want to check your work against the actual page scans. Often the scans contain formatting that is missing from the source transcription. For example, older transcriptions sometimes throw away italics entirely, and you’d never know unless you looked at the page scans. So finding page scans is essential.
Below are the three big resources for page scans. You should prefer them in this order:
You’ll enter a link to the page scans you used in the content.opf
metadata as a <dc:source>
element.
An epub file is just a bunch of files arranged in a particular folder structure, then all zipped up. That means editing an epub file is as easy as editing a bunch of text files within a certain folder structure, then creating a zip file out of that folder.
You can’t just arrange files willy-nilly, though—the epub standard expects certain files in certain places. So once you’ve picked a book to produce, create the basic epub skeleton in a working directory. se create-draft
will create a basic Standard Ebooks epub folder structure, initialize a Git repository within it, and prefill a few fields in content.opf
(the file that contains the ebook’s metadata).
If you inspect the folder we just created, you’ll see it looks something like this:
For this first commit:
git add -A git commit -m "Initial commit"
The file we downloaded contains the entire work. Jekyll is a short work, but for longer work it quickly becomes impractical to have the entire text in one file. Not only is it a pain to edit, but ereaders often have trouble with extremely large files.
The next step is to split the file at logical places; that usually means at each chapter break. For works that contain their chapters in larger “parts,” the part division should also be its own file. For example, see Treasure Island.
To split the work, we use se split-file
. se split-file
takes a single file and breaks it in to a new file every time it encounters the markup <!--se:split-->
. se split-file
automatically includes basic header and footer markup in each split file.
Once we’re happy that the source file has been split correctly, we can remove it.
rm src/epub/text/body.xhtml
If you open up any of the chapter files we now have in the src/epub/text/
folder, you’ll notice that the code isn’t very clean. Paragraphs are split over multiple lines, indentation is all wrong, and so on.
If you try opening a chapter in a web browser, you’ll also likely get an error if the chapter includes any HTML entities, like —
. This is because Gutenberg uses plain HTML, which allows entities, but epub uses XHTML, which doesn’t.
We can fix all of this pretty quickly using se clean
. se clean
accepts as its argument the root of a Standard Ebook directory. We’re already in the root, so we pass it .
.
se clean .
@@ -161,7 +190,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
Once the file split and cleanup is complete, you can perform your second commit.
git add -A git commit -m "Split files and clean"
Now that we have a clean starting point, we can start getting the real work done. se typogrify
can do a lot of the heavy lifting necessary to bring an ebook up to Standard Ebooks typography standards.
Like se clean
, se typogrify
accepts as its argument the root of a Standard Ebook directory.
se typogrify .
Among other things, se typogrify
does the following:
Once you’ve searched the work for the common issues above, if any manual changes were necessary, you should perform the fourth commit.
git add -A git commit -m "Manual typography changes"
Transcriptions often have errors, because the O.C.R. software might confuse letters for other, more unusual characters, or because the ebook’s character set got mangled somewhere along the way from the source to your repository. You’ll find most transcription errors when you proofread the text, but right now you use the se find-unusual-characters
tool to see a list of any unusual characters in the transcription. If the tool outputs any, check the source to make sure those characters aren’t errors.
se find-unusual-characters .
If any errors had to be corrected, a commit is needed as well.
git add -A git commit -m "Correct transcription errors"
Works often include footnotes, either added by an annotator or as part of the work itself. Since ebooks don’t have a concept of a “page,” there’s no place for footnotes to go. Instead, we convert footnotes to a single endnotes file, which will provide popup references in the final epub.
The endnotes file and the format for endnote links are standardized in the SEMoS.
If you find that you accidentally mis-ordered an endnote, never fear! se shift-endnotes
will allow you to quickly rearrange endnotes in your ebook.
Jekyll doesn’t have any footnotes or endnotes, so we skip this step.
If a work has illustrations besides the cover and title pages, we include a “list of illustrations” at the end of the book, after the endnotes but before the colophon. The LoI file is also standardized.
If an LOI is created, do a corresponding commit.
git add -A git commit -m "Add LOI"
Jekyll doesn’t have any illustrations, so we skip this step.
If the work you’re producing uses British quotation style (single quotes for dialog and other outer quotes versus double quotes in American), we have to convert it to American style. We use American style in part because it’s easier to programmatically convert from American to British than it is to convert the other way around. Skip this step if your work is already in American style.
se british2american
attempts to automate the conversion. Your work must already be typogrified (the previous step in this guide) for the script to work.
se british2american .
While se british2american
tries its best, thanks to the quirkiness of English punctuation rules it’ll invariably mess some stuff up. Proofreading is required after running the conversion.
After you’ve run the conversion, do another commit.
git commit -am "Convert from British-style quotation to American style"
Part of producing a book for Standard Ebooks is adding meaningful semantics wherever possible in the text. se semanticate
does a little of that for us—for example, for some common abbreviations—but much of it has to be done by hand.
Adding semantics means two things:
After you’ve added semantics according to the SEMoS, do another commit.
git commit -am "Manually add additional semantics"
Many older works use outdated spelling and hyphenation that would distract a modern reader. (For example, to-night
instead of tonight
). se modernize-spelling
automatically removes hyphens from words that used to be compounded, but aren’t anymore in modern English spelling.
Do run this tool on prose. Don’t run this tool on poetry.
se modernize-spelling .
@@ -464,7 +493,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
git commit -am "[Editorial] Modernize hyphenation and spelling"
Sometimes during transcription or even printing, instances of some words might have diacritics while others don’t. For example, a word in one chapter might be spelled châlet
, but in the next chapter it might be spelled chalet
.
se find-mismatched-diacritics
lists these instances for you to review. Spelling should be normalized across the work so that all instances of the same word are spelled in the same way. Keep the following in mind as you review these instances:
If any changes had to be made, a corresponding editorial commit should be done as well.
git commit -am "[Editorial] Correct mismatched diacritics"
Similar to se find-mismatched-diacritics
, se find-mismatched-dashes
lists instances where a compound word is spelled both with and without a dash. Dashes in words should be normalized to one or the other style.
se find-mismatched-dashes .
If corrections were made, another commit is needed.
git commit -am "[Editorial] Correct mismatched dashes"
<title>
elements<title>
elementsAfter you’ve added semantics and correctly marked up section headers, it’s time to update the <title>
elements in each chapter to match their expected values.
The se build-title
tool takes a well-marked-up section header from a file, and updates the file’s <title>
element to match:
se build-title .
@@ -496,7 +525,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
git commit -am "Add titles"
In content.opf
, the manifest is a list of all of the files in the ebook. The spine is the reading order of the various XHTML files.
se build-manifest
and se build-spine
will create these for you. Run these on our source directory and they’ll update the <manifest>
and <spine>
elements in content.opf
.
With the spine in the right order, we can now build the table of contents.
The table of contents is a structured document that lets the reader easily navigate the book. In a Standard Ebook, it’s stored outside of the readable text directory with the assumption that the reading system will parse it and display a navigable representation for the user.
Use se build-toc
to generate a table of contents for this ebook.
git commit -am "Add ToC"
Before you build the ebook for proofreading, it’s a good idea to check the ebook for some common problems you might have run in to during production.
First, run se clean
one more time to both clean up the source files, and to alert you if there are XHTML parsing errors. Even though we ran se clean
before, it’s likely that in the course of production the ebook got in to less-than-perfect markup formatting. Remember you can run se clean
as many times as you want—it should always produce the same output.
se clean .
@@ -532,7 +561,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
If there are no errors, se lint
will complete silently—but again, at this stage we’re expecting to see some errors because our ebook isn’t done yet.
At this point, our ebook is still missing some important things—a cover, the colophon, and some metadata—but the actual book is in a state where we can start proofreading. We complete a cover-to-cover proofread now, even though there’s still work to be done on the ebook, because once you’ve actually read the book, you’ll have a better idea of what kind of cover to select and what to write in the metadata description.
se build
will create a usable epub file for transfer to your ereader. We’ll run it with the --kindle
and --kobo
flag to build a file for Kindles and Kobos too. If you won’t be using a Kindle or Kobo, you can omit those flags.
se build --output-dir=$HOME/dist/ --kindle --kobo .
@@ -577,7 +606,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
content.opf
is the file that contains the ebook metadata like author, title, description, and reading order. Most of it will be filling in that basic information, and including links to various resources related to the text. We already completed the manifest and spine in an earlier step.
content.opf
is standardized. See the Metadata section of the SEMoS for details on how to fill it out.
The last details to fill out here will be the short and long descriptions, verifying any Wikipedia links that se create-draft
automatically found, adding cover artist metadata, filling out any missing author or contributor metadata, and adding your own metadata as the ebook producer.
git commit -am "Complete content.opf"
se create-draft
put a skeleton imprint.xhtml
file in the ./src/epub/text/
folder. Fill out the links to the transcription and page scans.
There’s also a skeleton colophon.xhtml
file. Now that we have the cover image and artist, we can fill out the various fields there. Make sure to credit the original transcribers of the text (generally we assume them to be whoever’s name is on the file we download from Project Gutenberg) and to include a link back to the Gutenberg text we used, along with a link to any scans we used (from the Internet Archive or Hathi Trust, for example).
You can also include your own name as the producer of this Standard Ebooks edition. Besides that, the colophon is standardized; don’t get too creative with it.
@@ -645,7 +674,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyllgit commit -am "Complete the imprint and colophon"
It’s a good idea to run se typogrify
and se clean
one more time before running these final checks. Make sure to review the changes with git difftool
before accepting them—se typogrify
is usually right, but not always!
Now that our ebook is complete, let’s verify that there are no errors at the S.E. style level:
se lint .
@@ -654,7 +683,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
Once that completes without errors, we’re ready to move on to the final step!
You’re ready to publish!