pipe_ooo: a simple OpenOffice.org documents text extractor

News | About | Contact | Docs | Links | Notes


This is a simple shell script I use to get the text of openoffice documents. I probably should have used a read XML parser, but so far it works resonably well.

The script

pipe_ooo: intended to handle a file from the standard input. Should be trivial to make it work for $1 as well.

#!/bin/sh

tmpfile=`mktemp` || exit 1

cat >$tmpfile

unzip -p $tmpfile content.xml \
|  sed -e 's|]*>||g' \
| sed -e 's|\"|"|g'

rm $tmpfile