Preview of documents

How to set up image previews of word and openoffice documents.


if you upload a 100 page document, we don’t have any way of creating a preview of just the first page. we need to convert the entire document (either to ooffice or pdf) before we can grab the preview. maybe we should keep the converted document around?

abiword method

1> abiword --to=document.pdf document.doc
2> gm convert -geometry 512x512 document.pdf 'document%02d.jpg[0-2]'

This will create document01.jpg, document02.jpg, and document03.jpg.

  • line 1 runs abiword in the command line to convert the .doc file (or any file that abiword can read) into a pdf.
  • line 2 uses graphicmagick to convert the first three pages of the pdf into jpeg images with a height or width no greater than 512 (and respecting the document’s aspect ratio).

If you don’t want multiple pages previewed, then change the output document to simply be document.jpg\[0]

You can also generate png files with the gm command, but they don’t look any better than the .jpg and are about three times the file size.

This is also supposed to work, but kept crashing for me:

shell> abiword --plugin AbiCommand 
abiword> previewpng 'test.doc' test.png 512 512

gm convert -geometry 512x512 -density 60 t.pdf[0] t.jpg

this might only load the first page of the pdf

openoffice method

It might be nice to automatically convert any document uploaded in a closed format into an open format. you could still download the original, but then other people would have access to the open document file.

some reasons why this might be a good idea:

  • makes full text searching easier. just parse the text from the xml file.
  • makes generating a preview easier. just extract the preview png in the ooffice file (which is just a zip file with a bunch of xml files in it).
  • open file formats are better than proprietary ones :)
  • with a single method, we could support anything that ooffice can open (excel, powerpoint, etc)

why it might be a bad idea:

  • lots of extra storage
  • The preview png for openoffice is kind of small

example code to extract the preview png:

batch converting to ooffice on the command line:

ooffice api:

parsing ooffice xml files in ruby: