mirror of
https://github.com/standardebooks/web.git
synced 2025-07-14 10:31:59 -04:00
Convert step-by-step instructions to use perl instead of sed
GNU sed isn’t posix compatible, so fails on bsd systems. We can use perl instead (enforcing utf8 where necessary). Also add a warning to check roman numerals.
This commit is contained in:
parent
334753103c
commit
2323a5a574
1 changed files with 6 additions and 6 deletions
|
@ -115,7 +115,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
|
|||
<p>The file we downloaded contains the entire work. <i>Jekyll</i> is a short work, but for longer work it quickly becomes impractical to have the entire text in one file. Not only is it a pain to edit, but ereaders often have trouble with extremely large files.</p>
|
||||
<p>The next step is to split the file at logical places; that usually means at each chapter break. For works that contain their chapters in larger “parts,” the part division should also be its own file. For example, see <i><a href="/ebooks/robert-louis-stevenson/treasure-island">Treasure Island</a></i>.</p>
|
||||
<p>To split the work, we use <code class="bash"><b>se</b> split-file</code>. <code class="bash"><b>se</b> split-file</code> takes a single file and breaks it in to a new file every time it encounters the markup <code class="html"><span class="c"><!--se:split--></span></code>. <code class="bash"><b>se</b> split-file</code> automatically includes basic header and footer markup in each split file.</p>
|
||||
<p>Notice that in our source file, each chapter is marked with an <code class="html"><span class="p"><</span><span class="nt">h2</span><span class="p">></span></code> element. We can use that to our advantage and save ourselves the trouble of adding the <code class="html"><span class="c"><!--se:split--></span></code> markup by hand:</p><code class="terminal"><span><b>sed</b> --in-place <!--Single quote to prevent ! from becoming history expansion--><i>'s|<h2|<!--se:split--><h2|g'</i> <u>src/epub/text/body.xhtml</u></span></code>
|
||||
<p>Notice that in our source file, each chapter is marked with an <code class="html"><span class="p"><</span><span class="nt">h2</span><span class="p">></span></code> element. We can use that to our advantage and save ourselves the trouble of adding the <code class="html"><span class="c"><!--se:split--></span></code> markup by hand:</p><code class="terminal"><span><b>perl</b> -pi -e <!--Single quote to prevent ! from becoming history expansion--><i>'s|<h2|<\!--se:split--><h2|g'</i> <u>src/epub/text/body.xhtml</u></span></code>
|
||||
<p>Now that we’ve added our markers, we split the file. <code class="bash"><b>se</b> split-file</code> puts the results in our current directory and conveniently names them by chapter number.</p><code class="terminal"><span><b>se</b> split-file <u>src/epub/text/body.xhtml</u></span> <span><b>mv</b> chapter<i class="glob">*</i> <u>src/epub/text/</u></span></code>
|
||||
<p>Once we’re happy that the source file has been split correctly, we can remove it.</p><code class="terminal"><span><b>rm</b> <u>src/epub/text/body.xhtml</u></span></code>
|
||||
</li>
|
||||
|
@ -142,7 +142,7 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
|
|||
<span class="p"></</span><span class="nt">section</span><span class="p">></span>
|
||||
<span class="p"></</span><span class="nt">body</span><span class="p">></span>
|
||||
<span class="p"></</span><span class="nt">html</span><span class="p">></span></code></figure>
|
||||
<p>If you look carefully, you’ll notice that the <code class="html"><span class="p"><</span><span class="nt">html</span><span class="p">></span></code> element has the <code class="html"><span class="na">xml:lang</span><span class="o">=</span><span class="s">"en-US"</span></code> attribute, even though our source text uses British spelling! We have to change the <code class="html"><span class="na">xml:lang</span></code> attribute for the source files to match the actual language, which in this case is en-GB. Let’s do that now:</p><code class="terminal"><span><b>sed</b> --in-place <i>"s|en-US|en-GB|g"</i> src/epub/text/chapter<i class="glob">*</i></span></code>
|
||||
<p>If you look carefully, you’ll notice that the <code class="html"><span class="p"><</span><span class="nt">html</span><span class="p">></span></code> element has the <code class="html"><span class="na">xml:lang</span><span class="o">=</span><span class="s">"en-US"</span></code> attribute, even though our source text uses British spelling! We have to change the <code class="html"><span class="na">xml:lang</span></code> attribute for the source files to match the actual language, which in this case is en-GB. Let’s do that now:</p><code class="terminal"><span><b>perl</b> -pi -e <i>"s|en-US|en-GB|g"</i> src/epub/text/chapter<i class="glob">*</i></span></code>
|
||||
<p>Note that we <em>don’t</em> change the language for the metadata or front/back matter files, like <code class="path">content.opf</code>, <code class="path">titlepage.xhtml</code>, or <code class="path">colophon.xhtml</code>. Those must always be in American spelling, so they’ll always have the en-US language tag.</p>
|
||||
</li>
|
||||
<li>
|
||||
|
@ -259,11 +259,11 @@ proceed to seal up my confession, I bring the life of that unhappy Henry Jekyll
|
|||
<ul>
|
||||
<li>
|
||||
<p>Semantics for italics: <code class="html"><span class="p"><</span><span class="nt">em</span><span class="p">></span></code> should be used for when a passage is emphasized, as in when dialog is shouted or whispered. <code class="html"><span class="p"><</span><span class="nt">i</span><span class="p">></span></code> is used for all other italics, <a href="/manual/latest/4-semantics#4.2">with the appropriate semantic inflection</a>. Older transcriptions usually use just <code class="html"><span class="p"><</span><span class="nt">i</span><span class="p">></span></code> for both, so you must change them manually if necessary.</p>
|
||||
<p>Sometimes, transcriptions from Project Gutenberg may use ALL CAPS instead of italics. To replace these, you can use <code class="bash"><b>sed</b></code>:</p>
|
||||
<code class="terminal"><span><b>sed</b> --regexp-extended --in-place <i>"s|([A-Z’]{2,})|<em>\L\1</em>|g"</i> src/epub/text/<i class="glob">*</i></span></code>
|
||||
<p>Sometimes, transcriptions from Project Gutenberg may use ALL CAPS instead of italics. To replace these, you can use:</p>
|
||||
<code class="terminal"><span><b>perl</b> -pi -e <i>"use utf8;s|([A-Z’]{2,})|<em>\L\1</em>|g"</i> src/epub/text/<i class="glob">*</i></span></code>
|
||||
<p>This will unfortunately replace language tags like <code>en-US</code>, so fix those up with this:</p>
|
||||
<code class="terminal"><span><b>sed</b> --regexp-extended --in-place <i>"s|en-<em>([a-z]+)</em>|en-\U\1|g"</i> src/epub/text/<i class="glob">*</i></span></code>
|
||||
<p>These replacements don’t take Title Caps into account, so use <code class="bash"><b>git</b> diff</code> to review the changes and fix errors before committing.</p>
|
||||
<code class="terminal"><span><b>perl</b> -pi -e <i>"use utf8;s|en-<em>([a-z]+)</em>|en-\U\1|g"</i> src/epub/text/<i class="glob">*</i></span></code>
|
||||
<p>These replacements don’t take Title Caps or roman numerals into account, so use <code class="bash"><b>git</b> diff</code> to review the changes and fix errors before committing.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><a href="/manual/latest/8-typography#8.1">Semantics rules for chapter titles</a>.</p>
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue