<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>offkey &#187; text</title>
	<atom:link href="http://churchkey.org/category/text/feed/" rel="self" type="application/rss+xml" />
	<link>http://churchkey.org</link>
	<description>software, networks, language, data</description>
	<lastBuildDate>Mon, 02 Apr 2012 20:36:17 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Software For Everyone</title>
		<link>http://churchkey.org/2010/08/26/bklib-software-for-everyone/</link>
		<comments>http://churchkey.org/2010/08/26/bklib-software-for-everyone/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 14:55:03 +0000</pubDate>
		<dc:creator>ian</dc:creator>
				<category><![CDATA[book]]></category>
		<category><![CDATA[BookLiberator]]></category>
		<category><![CDATA[text]]></category>
		<category><![CDATA[bklib]]></category>
		<category><![CDATA[book digitization]]></category>
		<category><![CDATA[book liberator]]></category>
		<category><![CDATA[djvu]]></category>
		<category><![CDATA[djvubind]]></category>
		<category><![CDATA[image procesing]]></category>
		<category><![CDATA[OCR]]></category>
		<category><![CDATA[page processing]]></category>
		<category><![CDATA[personal digitization]]></category>
		<category><![CDATA[scantailor]]></category>

		<guid isPermaLink="false">http://churchkey.org/?p=352</guid>
		<description><![CDATA[We talk a lot about hardware here at BookLiberator, it is what we spend most of our time on after all, but it is time to shine a light on the software behind the scenes that turns our page images into beautifully produced &#8220;book&#8221; collections. That software comes in two parts, scantailor, written by Joseph [...]]]></description>
			<content:encoded><![CDATA[<p>We talk a lot about hardware here at BookLiberator, it is what we spend most of our time on after all, but it is time to shine a light on the software behind the scenes that turns our page images into beautifully produced &#8220;book&#8221; collections. That software comes in two parts, <a href="http://scantailor.sourceforge.net/">scantailor</a>, written by <a href="http://sourceforge.net/users/jart/">Joseph Artsimovich</a> and <a href="https://code.google.com/p/djvubind/">djvubind</a>, written by <a href="http://diybookscanner.org/forum/viewtopic.php?f=3&#038;t=521">strider1551</a> of DIYBookScanner.</p>
<p>Scantailor takes the page images from your camera&#8217;s memory card: </p>
<p><a href="http://bookliberator.org/blog/wp-content/uploads/2010/08/0004a.jpg"><img src="http://bookliberator.org/blog/wp-content/uploads/2010/08/0004a-300x225.jpg" alt="Page from Concerning Beards" title="Raw Page Image" width="300" height="225" class="alignnone size-medium wp-image-119" /></a> </p>
<p>and turns them into nicely cropped, rotated, and white balanced images like this: </p>
<p><a href="http://bookliberator.org/blog/wp-content/uploads/2010/08/0004a1.jpg"><img src="http://bookliberator.org/blog/wp-content/uploads/2010/08/0004a1-214x300.jpg" alt="Processed image from Concerning Beards" title="Scantailor processed page image" width="214" height="300" class="alignnone size-medium wp-image-125" /></a></p>
<p>Djuvubind takes all of those individual images, stitches them together, and compresses that into a very tiny book in the <a href="https://secure.wikimedia.org/wikipedia/en/wiki/Djvu">djvu format</a>. I have 1400 page academic books that are now pleasantly readable 10 MB files thanks to this combination of Scantailor and Djvubind. </p>
<p>All of this happens automatically. For each of those 1400 page books all I had to do was 1) rotate the first two pages, 2) hit &#8220;Go&#8221; for auto crop, 3) draw a box around the few pictures so that their full resolution would be preserved in the final output, 4) run djvubind. </p>
<p>Very simple, very easy. When djvubind, which is less than <a href="http://diybookscanner.org/forum/viewtopic.php?f=3&#038;t=521">two weeks old</a>, gets the last <a href="https://code.google.com/p/djvubind/issues/detail?id=4">kinks</a> out, it will be possible to use the same 4 steps to get a tiny book full of beautiful page images which </i>also</i> has a layer of OCR embedded for text searching.</p>
<p>For anyone who has been waiting to get into personal book scanning until the software develops, wait no more.</p>
<p><i>Crossposted with <a href="http://bookliberator.org/blog/?p=118">BookLiberator</a></i></p>
]]></content:encoded>
			<wfw:commentRss>http://churchkey.org/2010/08/26/bklib-software-for-everyone/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bittorrent and Miro, a better Distributed Proofreading</title>
		<link>http://churchkey.org/2010/05/24/bittorrent-and-miro-a-better-distributed-proofreading/</link>
		<comments>http://churchkey.org/2010/05/24/bittorrent-and-miro-a-better-distributed-proofreading/#comments</comments>
		<pubDate>Mon, 24 May 2010 21:21:43 +0000</pubDate>
		<dc:creator>ian</dc:creator>
				<category><![CDATA[book]]></category>
		<category><![CDATA[planet]]></category>
		<category><![CDATA[text]]></category>
		<category><![CDATA[bittorrent]]></category>
		<category><![CDATA[bklib]]></category>
		<category><![CDATA[book digitization]]></category>
		<category><![CDATA[book liberator]]></category>
		<category><![CDATA[distributed proofreading]]></category>
		<category><![CDATA[miro]]></category>
		<category><![CDATA[OCR]]></category>
		<category><![CDATA[project gutenberg]]></category>

		<guid isPermaLink="false">http://churchkey.org/?p=261</guid>
		<description><![CDATA[If you spend some time in the ebook community you inevitably run into Distributed Proofreading, the collaborative proofreading group that supplies Project Gutenberg with high quality text versions of Public Domain books. They are a small community of dedicated editors doing good work. Unfortunately, they are also becoming irrelevant to most of the issues in [...]]]></description>
			<content:encoded><![CDATA[<p>If you spend some time in the ebook community you inevitably run into <a href="http://www.pgdp.net/">Distributed Proofreading</a>, the collaborative proofreading group that supplies <a href="http://www.gutenberg.org/wiki/Main_Page">Project Gutenberg</a> with high quality text versions of <a href="http://en.wikipedia.org/wiki/Public_domain">Public Domain</a> books.  They are a small community of dedicated editors doing good work. Unfortunately, they are also becoming irrelevant to most of the issues in the field because their <a href="http://www.pgdp.net/c/faq/DPflow.php">multi-layer workflow</a> is simply too slow. When organizations like Google are releasing <a href="http://www.readwriteweb.com/archives/google_opens_up_its_epub_archive_download_1_million_books_for_free.php">a million</a> books at once, it is hard to stay relevant when struggling to complete your project&#8217;s 20,000 book, even if those books, unlike Google&#8217;s, are meticulously verified and formatted. Scale and quality both matter and, if we structure it right, we can rework our communal digitization projects to get both.</p>
<p>Currently, Distributed Proofreaders only releases books after spending weeks or months verifying that the text version matches the original page images.  The industrial scanning efforts like <a href="http://en.wikipedia.org/wiki/Google_books">Google Books</a> and the <a href="http://en.wikipedia.org/wiki/Million_books_project">Million Books Project</a> generally skip verification entirely and distribute raw text versions with the photographic page images.  This is perhaps the greatest key to their large size.  Yes, they also paid for large scale scanning but scanning is easy compared to proofreading, and getting <a href="http://bookliberator.org">getting easier</a> all the time.  You can be sure that Google&#8217;s library would not be half so large if they had to pay for the kind of quality that Distributed Proofreaders provides. Unfortunately, if the price of this quality is only having thousands rather than millions of books, it is too high to continue paying.</p>
<p>I propose a middle road between the raw image release and the meticulous text one. What if we distributed raw image and unverified text files from day one, but build our distribution network to enable everyone downloading a copy to upload corrections and share those corrections automatically with everyone else who has a copy? If we did that we could gain speed and scale while also building our community of contributers. </p>
<p>Technologically, bittorrent and a rich client like <a href="http://getmiro.com/">miro</a> would get us most of the way there. We would make each book into a miro channel that people would subscribe to when downloading the book. Once downloaded we would need a book reading view that we could optimize for whatever common reader actions relate to proofreading. Things like spell check and revealing the text around a section to verify academic citations spring immediately to mind. The key is that corrections should come primarily from people&#8217;s normal interactions with the books they are interested in, no altruism or active volunteering necessary. Once people have corrected their local copies, the client sends those corrections back to the central server where they can be sent out via rss to everyone subscribed to that book&#8217;s channel. </p>
<p>As far as the user is concerned, she simply downloads the books she is interested in with her miro-based library manager and either fixes errors as they bother her, or leaves them alone and watches the text gradually correct itself as other people interested in the same books notice and correct errors. If the errors are really frustrating, she can always fall back to reading the page images and be no worse off than if reading on Google Books or any other large page image-based digital library. </p>
<p>As far as the community is concerned, we get a larger pool of potential contributers because now everyone with a copy can contribute back, and people are able to contribute by sharing spare hard drive space and unused bandwidth rather than having to donate funds to pay for central hosting and distribution. There are plenty of people in the community who have no time or inclination to proofread but would gladly download some book images and leave a torrent running in the background to help share the files more widely.</p>
<p>Making it easier to contribute increases the effectiveness of the project as a whole by helping make sure that all the people who care about a book have the opportunity to put their time into preserving that book. The more people care, the more work gets done. In two years of talking with people about my own book digitization projects, I have grown to have a healthy respect for how much people care about their own books and about preserving them, in whatever form.</p>
<p>In the end, there are only two scalable digitization strategies: teach computers to read, or harness the passion people have for their books for the benefit of us all. A handful of highly organized editors like the Distributed Proofreaders community will always have it&#8217;s place, but they cannot handle the scale of this project alone.  We should make sure they have some help.</p>
<p><i>(Crossposted with <a href="http://bookliberator.org/blog/?p=89">bookliberator</a>)</i></p>
]]></content:encoded>
			<wfw:commentRss>http://churchkey.org/2010/05/24/bittorrent-and-miro-a-better-distributed-proofreading/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Time spent</title>
		<link>http://churchkey.org/2009/02/09/time-spent/</link>
		<comments>http://churchkey.org/2009/02/09/time-spent/#comments</comments>
		<pubDate>Tue, 10 Feb 2009 01:24:37 +0000</pubDate>
		<dc:creator>ian</dc:creator>
				<category><![CDATA[planet]]></category>
		<category><![CDATA[text]]></category>
		<category><![CDATA[bkrpr]]></category>

		<guid isPermaLink="false">http://churchkey.org/?p=26</guid>
		<description><![CDATA[If you are curious about how I spend my time, as I know handfuls of people on earth are, here is today&#8217;s answer: Bkrpr Blog &#8211; Paperback testing begins in earnest. The longer answer is that it is a device I&#8217;ve been working on since the summer to more easily convert my paper books into [...]]]></description>
			<content:encoded><![CDATA[<p>If you are curious about how I spend my time, as I know handfuls of people on earth are, here is today&#8217;s answer: <a href="http://bkrpr.org/blog/?p=25">Bkrpr Blog &#8211; Paperback testing begins in earnest</a>.</p>
<p>The longer answer is that it is a device I&#8217;ve been working on since the summer to more easily convert my paper books into a digital form. I&#8217;ve had test hardware working for a number of months but things were going pretty slowly until <a href="http://hackervisions.org/?author=1">James</a> decided to build some image processing scripts to accompany the effort. Those scripts became a fully fledged python application around the end of the year, and we&#8217;ve since begun documenting the project in earnest.</p>
<p>The <a href="http://bkrpr.org">bkrpr wiki</a> has all the relevant links, and a nice front page <a href="http://www.youtube.com/watch?v=yjRKeHPRa2k">YouTube</a> video of the device in use. Or a very poor YouTube video of the inside of my room, depending on how you look at it.</p>
<p>If you&#8217;re curious to check it out, take a look at the site, or grab the processed test pages <a href="http://www.flickr.com/photos/bkrpr/">online</a> or in <a href="http://bkrpr.org/download/testimages.pdf">pdf</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://churchkey.org/2009/02/09/time-spent/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Death of the Word Processor</title>
		<link>http://churchkey.org/2008/07/22/the-death-of-the-word-processor/</link>
		<comments>http://churchkey.org/2008/07/22/the-death-of-the-word-processor/#comments</comments>
		<pubDate>Tue, 22 Jul 2008 22:10:53 +0000</pubDate>
		<dc:creator>ian</dc:creator>
				<category><![CDATA[planet]]></category>
		<category><![CDATA[text]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[markup]]></category>
		<category><![CDATA[pandoc]]></category>
		<category><![CDATA[word processors]]></category>

		<guid isPermaLink="false">http://churchkey.org/?p=6</guid>
		<description><![CDATA[I&#8217;ve done a lot of writing in word processors over the years, but now that I work in an office full time, I no longer have much use for them. The passing of this once essential program into the category of &#8220;sometimes comes in handy&#8221; seems worth a moment of reflection, so I offer here [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve done a lot of writing in word processors over the years, but now that I work in an office full time, I no longer have much use for them. The passing of this once essential program into the category of &#8220;sometimes comes in handy&#8221; seems worth a moment of reflection, so I offer here a short eulogy and some words of explanation for the death of the word processor.</p>
<h4 id="looking-back">Looking back</h4>
<p>Before we mourn losing the word processor we should look back at what it is and what it has done for us. At its core the word processor is virtual paper and the promise of <a href="http://en.wikipedia.org/wiki/WYSIWYG">WYSIWYG</a>, the promise that, however you arrange things on the virtual paper, they will look the same on real paper when you print a copy. Other capabilities were built in later, things like change tracking, macro languages, and outlining modes, but these tools never take the spotlight; word processors remain word processors, not outlining tools or version control systems. It is all about the virtual paper.</p>
<p>And it was great. In Jr. High we spent weeks learning how to properly format documents. There were tests on where to place the opening line on a business letter, how many lines to skip between address blocks and the To: or From: lines, and other layout details. We were effectively learning typewriter office skills. Word processors made laying out documents so easy that simple formatting information like this could be stored for us. So, by the time any of us had to write a business letter, we no longer needed to remember how to format it. All we needed to do was pick &#8220;business letter&#8221; from the template menu of our word processor and remember to replace all the dummy &quot;YOUR NAME HERE&quot; text with our real information.</p>
<h4 id="problems-with-virtual-paper">Problems with virtual paper</h4>
<p>It was great, but there were problems. Competing word processors were often incapable of reading each other&#8217;s files, locking people into one camp or another. Time and competition between camps brought new versions of the word processor software; new tools were added to allow for more complicated layouts, to help correct common errors, and to make existing features easier to use. These changes increased what people could do but also brought further incompatibilities. New versions of the software had problems using old documents and documents in the new formats wouldn&#8217;t work at all with the old software.</p>
<p>Virtual paper began to age. People who had used these virtual sheets as a way to archive documents found that new word processors would corrupt the formatting in old documents and refuse entirely to open some of them. People who wanted to switch from one word processor camp to another had it worst, often having to rely on third party conversion utilities to use their old documents at all.</p>
<h4 id="digital-communication">Digital communication</h4>
<p>So virtual paper aged, people got increasingly tied to one format or another, and the internet happened. Now rather than exchanging the paper documents, people began exchanging the virtual paper versions, making it even harder to know what version of what program your document would be opened with. Once <a href="http://en.wikipedia.org/wiki/.doc">.doc</a> and <a href="http://en.wikipedia.org/wiki/.wpd">.wpd</a>, and later, <a href="http://en.wikipedia.org/wiki/.odt">.odt</a> documents were being sent around, the promise of WYSIWYG stopped meaning much. What you saw might be what you got but you could no longer know just what the other person was getting.</p>
<p>To regain control of their formatting, people began using <a href="http://en.wikipedia.org/wiki/Pdf">PDF</a>, the virtual printer for our virtual documents. By making a PDF of your document you basically trade the ability to edit your document in the future for an assurance that your document will look and print the same on any other computer. While this is useful for documents where formatting is important, resumes are an often cited example, it represents a step backwards for word processors.</p>
<p>Word processors are not the only, or even best, way to generate PDFs. Once PDFs became a standard form of print-ready documents, people built web applications to generate them from any page on a website as well as OS-level PDF printers that let you create a PDF out of any file on your computer. As the tools capable of producing print-ready documents multiplied, the word processor began to lose its place as people&#8217;s primary tool for authoring documents.</p>
<h4 id="simplification">Simplification</h4>
<p>At the same time that PDF was ironing out incompatibilities and creating a nearly universal form of print-ready virtual paper, people started to notice that a lot of their print-ready documents were never getting printed. As people grew more comfortable with digital means of communication they began to rely on them to carry more of the content once invested in paper. People who had once felt the need to attach word processor or PDF documents to their emails began moving the material from those documents into the email itself. Personal correspondence moved not only to email but to chat rooms and instant messaging, places where printing to paper was not a concern.</p>
<p>Free of the complicated formatting necessary for paper, people fell back to the handful of basic formatting options they felt most necessary for communication in a digital context, things like:</p>
<p>
&gt;&gt; quoting and<br />
&gt; later<br />
/italics/<br />
*bold*<br />
[ links | to places ] and</p>
<p>paragraph breaks.</p>
<p>Those five, in addition to the rich complexity of our natural languages, turn out to cover most of what people need for communicating the sense of their messages.</p>
<p>This trend, of replacing elaborate formatting, like virtual paper, with lightly marked up plain text, is shrinking the domain of word processors each day. Wikis, the largest document creation projects in history, all use variations of the basic formatting options shown above. Current social networking and publishing tools are the same, as is email. The rise of syndication formats like <a href="http://en.wikipedia.org/wiki/Rss">RSS</a> and <a href="http://en.wikipedia.org/wiki/ATOM">ATOM</a> is perhaps the best example of people happily removing all page layout formatting to more easily access the plaintext or lightly marked up text underneath.</p>
<p>Which is not to say that people stopped caring about the visual appearance of their documents. Layout and design remain as important as ever, but all of that information simply moved to the side, into <a href="http://en.wikipedia.org/wiki/Cascading_Style_Sheets">CSS</a> sheets, and blog or social networking theme packages.</p>
<p>The prime example of both these trends is the web itself. While many PDF documents, and some few .doc files, remain available online, they are dwarfed by the HTML ones that make up the web around them. The combination of HTML and CSS has beaten out all previous electronic formatting standards and become the most universal way to format writing since paper. Word processors have never created either format well.</p>
<h4 id="today">Today</h4>
<p>Today I neither send nor receive business letters. Almost all of my professional communication happens over email, including things like negotiating for event space, arranging travel, and handling requests for legal services from our office. I write my documents in <a href="http://en.wikipedia.org/wiki/Markdown">markdown</a>, one of the popular lightweight markup systems, and use a program called <a href="http://johnmacfarlane.net/pandoc/">pandoc</a> to convert them to whatever format I need: PDF, HTML, <a href="http://en.wikipedia.org/wiki/LaTeX">LaTeX</a>, or odt. Separate style sheets let me easily format the same text for letterhead, publishing on our website, publishing on my website, or general printing. The things I read come to me as websites, RSS feeds, email messages, and, most recently, ebooks (which are really just simple HTML documents re-packaged).</p>
<p>And so the word processor has passed from my life. I realize that it has not passed from everyone&#8217;s and it may not have left yours yet, but if the trends towards simple markup and PDFs continue forward, odds are that it will.</p>
]]></content:encoded>
			<wfw:commentRss>http://churchkey.org/2008/07/22/the-death-of-the-word-processor/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

