mirror of
https://github.com/standardebooks/web.git
synced 2025-07-21 06:45:14 -04:00
Remove reference to now-obsolete 'se clean --single-lines' option, and reference to removing spaces around opening/closing tags
This commit is contained in:
parent
a887b886c0
commit
bdc627ce65
1 changed files with 1 additions and 2 deletions
|
@ -124,8 +124,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
|
||||||
<h2>Clean up the source text</h2>
|
<h2>Clean up the source text</h2>
|
||||||
<p>If you open up any of the chapter files we now have in the <code class="path">src/epub/text/</code> folder, you’ll notice that the code isn’t very clean. Paragraphs are split over multiple lines, indentation is all wrong, and so on.</p>
|
<p>If you open up any of the chapter files we now have in the <code class="path">src/epub/text/</code> folder, you’ll notice that the code isn’t very clean. Paragraphs are split over multiple lines, indentation is all wrong, and so on.</p>
|
||||||
<p>If you try opening a chapter in a web browser, you’ll also likely get an error if the chapter includes any HTML entities, like <code class="html">&mdash;</code>. This is because Gutenberg uses plain HTML, which allows entities, but epub uses XHTML, which doesn’t.</p>
|
<p>If you try opening a chapter in a web browser, you’ll also likely get an error if the chapter includes any HTML entities, like <code class="html">&mdash;</code>. This is because Gutenberg uses plain HTML, which allows entities, but epub uses XHTML, which doesn’t.</p>
|
||||||
<p>We can fix all of this pretty quickly using <code class="bash"><b>se</b> clean</code>. <code class="bash"><b>se</b> clean</code> accepts as its argument the root of a Standard Ebook directory, and with the <code class="bash">--single-lines</code> option it’ll remove the hard line wrapping that Gutenberg is fond of. We’re already in the root, so we pass it <code class="path">.</code>.</p><code class="terminal"><span><b>se</b> clean --single-lines <u>.</u></span></code>
|
<p>We can fix all of this pretty quickly using <code class="bash"><b>se</b> clean</code>. <code class="bash"><b>se</b> clean</code> accepts as its argument the root of a Standard Ebook directory. We’re already in the root, so we pass it <code class="path">.</code>.</p><code class="terminal"><span><b>se</b> clean <u>.</u></span></code>
|
||||||
<p>Things look much better now, but we’re not perfect yet. If you open a chapter you’ll notice that the <code class="html"><span class="p"><</span><span class="nt">p</span><span class="p">></span></code> and <code class="html"><span class="p"><</span><span class="nt">h2</span><span class="p">></span></code> tags have a space between the tag and the text. We can clean that up with a few <code class="bash"><b>perl</b></code> commands.</p><code class="terminal"><span><b>perl</b> -pi -e <i>"s|<(p|h2)>\s+|<\1>|g"</i> src/epub/text/chapter<i class="glob">*</i></span> <span><b>perl</b> -pi -e <i>"s|\s+</(p|h2)>|</\1>|g"</i> src/epub/text/chapter<i class="glob">*</i></span></code>
|
|
||||||
<p>Finally, we have to do a quick runthrough of each file by hand to cut out any lingering Gutenberg markup that doesn’t belong. In <i>Jekyll</i>, notice that each chapter ends with some extra empty <code class="html"><span class="p"><</span><span class="nt">div</span><span class="p">></span></code>s and <code class="html"><span class="p"><</span><span class="nt">p</span><span class="p">></span></code>s. These were used by the original transcriber to put spaces between the chapters, and they’re not necessary anymore, so remove them before continuing.</p>
|
<p>Finally, we have to do a quick runthrough of each file by hand to cut out any lingering Gutenberg markup that doesn’t belong. In <i>Jekyll</i>, notice that each chapter ends with some extra empty <code class="html"><span class="p"><</span><span class="nt">div</span><span class="p">></span></code>s and <code class="html"><span class="p"><</span><span class="nt">p</span><span class="p">></span></code>s. These were used by the original transcriber to put spaces between the chapters, and they’re not necessary anymore, so remove them before continuing.</p>
|
||||||
<p>Now our chapter 1 source looks like this:</p>
|
<p>Now our chapter 1 source looks like this:</p>
|
||||||
<figure><code class="html full"><span class="cp"><?xml version="1.0" encoding="utf-8"?></span>
|
<figure><code class="html full"><span class="cp"><?xml version="1.0" encoding="utf-8"?></span>
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue