From bdc627ce6510c7bc23821fafde4e6f099e42baab Mon Sep 17 00:00:00 2001 From: Alex Cabal Date: Tue, 28 Apr 2020 12:44:13 -0500 Subject: [PATCH] Remove reference to now-obsolete 'se clean --single-lines' option, and reference to removing spaces around opening/closing tags --- www/contribute/producing-an-ebook-step-by-step.php | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/www/contribute/producing-an-ebook-step-by-step.php b/www/contribute/producing-an-ebook-step-by-step.php index dad2c248..27e06180 100644 --- a/www/contribute/producing-an-ebook-step-by-step.php +++ b/www/contribute/producing-an-ebook-step-by-step.php @@ -124,8 +124,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll

Clean up the source text

If you open up any of the chapter files we now have in the src/epub/text/ folder, you’ll notice that the code isn’t very clean. Paragraphs are split over multiple lines, indentation is all wrong, and so on.

If you try opening a chapter in a web browser, you’ll also likely get an error if the chapter includes any HTML entities, like —. This is because Gutenberg uses plain HTML, which allows entities, but epub uses XHTML, which doesn’t.

-

We can fix all of this pretty quickly using se clean. se clean accepts as its argument the root of a Standard Ebook directory, and with the --single-lines option it’ll remove the hard line wrapping that Gutenberg is fond of. We’re already in the root, so we pass it ..

se clean --single-lines . -

Things look much better now, but we’re not perfect yet. If you open a chapter you’ll notice that the <p> and <h2> tags have a space between the tag and the text. We can clean that up with a few perl commands.

perl -pi -e "s|<(p|h2)>\s+|<\1>|g" src/epub/text/chapter* perl -pi -e "s|\s+</(p|h2)>|</\1>|g" src/epub/text/chapter* +

We can fix all of this pretty quickly using se clean. se clean accepts as its argument the root of a Standard Ebook directory. We’re already in the root, so we pass it ..

se clean .

Finally, we have to do a quick runthrough of each file by hand to cut out any lingering Gutenberg markup that doesn’t belong. In Jekyll, notice that each chapter ends with some extra empty <div>s and <p>s. These were used by the original transcriber to put spaces between the chapters, and they’re not necessary anymore, so remove them before continuing.

Now our chapter 1 source looks like this:

<?xml version="1.0" encoding="utf-8"?>