The source of this document is available on gitlab.
Last version: 2020-11-25

Finding one's way with tags and desktop search application

Table of contents

Leibniz quote

I found the introductory quote on the http://www.backwordsindexing.com/index.html website. Leibniz was a librarian for a fair part of his life, this partly explains is concern for classification, indexation, etc.

Searching with a text editor

The corresponding slide is here to remind its viewer something already known and that is perceived as a huge improvement by people switching from paper to numerical note taking.

Unix/Linux users also know the grep command-line utility for searching plain-text data sets for lines that match a regular expression in one or several files; we will come back to it.

Search with a hand-made index in a notebook

Again something we already discussed (in sequence 2).

Search with a "materialized" index

A reminder.

Towards the "sophisticated" tools of computers

Desktop search engines

Desktop search engines like:

allow us to search for the content of text files, emails, files generated by word processorsi.e. files that essentially contain text, but are stored in a standard format doc, docx, odt, etc. that are not text formats–, pdf files–when they are not images of text–, but also the metadata of pdf, etc.

Desktop search engines use indexing techniques that significantly reduce search times, compared to the search functions built into operating systems by default. Unlike the latter, they also often support metadata, and are able to make a parsing of the files.

As an example of "integrated default search functions", we will find on Unix/Linux systems the program grep with which we can search for occurrences of the word "Placcius" in the "module1/resources" directory of our repository mooc-rr-ressources (after cloning it):

grep -r Placcius
sequence1.org:- [[#note-cabinets-from-placcius-and-leibniz][Note cabinets from Placcius and Leibniz]]
sequence1.org:* Note cabinets from Placcius and Leibniz
sequence2_fr.org:Nous revenons sur le « bout de papier » ou la fiche comme support de note. L'inconvénient est que le bout de papier ou la fiche se perdent facilement et ne servent à rien s'ils ne sont pas *classés* en plus d'être rangés. Problème résolu par l'armoire de Placcius. D'une certaine façon, sa conception fait qu'on accède à son contenu par l'index.
sequence2.org:We see (again) Placcius' and Leibniz's closet since it displays both the benefits and the shortcomings of media that hold *a single note*.
sequence2.org:These problems are solved by Placcius' cabinet, the content of which is fundamentally accessed through the index.
sequence5_fr.org:- les notes manuscrites sur fiches sont généralement stockées dans un meuble dont la structure matérialise un index — comme l'armoire de Placcius et Leibniz — ;
sequence5_fr.org:: PITCHME.md:Remarquez l'avantage des « bouts de papiers classés » de Placcius et Leibniz sur le _codex_ de Galilée : les premiers peuvent être facilement réordonnées.
sequence1_fr.org:- [[#armoires-à-notes-de-placcius-et-leibniz][Armoires à notes de Placcius et Leibniz]]
sequence1_fr.org:* Armoires à notes de Placcius et Leibniz
#sequence5_fr.org#:- les notes manuscrites sur fiches sont généralement stockées dans un meuble dont la structure matérialise un index — comme l'armoire de Placcius et Leibniz — ;
#sequence5_fr.org#:module1/ressources/sequence5_fr.org:: PITCHME.md:Remarquez l'avantage des « bouts de papiers classés » de Placcius et Leibniz sur le _codex_ de Galilée : les premiers peuvent être facilement réordonnées.
#sequence5_fr.org#:module1/slides/misc/Notes_module1.org:: PITCHME.md:Remarquez l'avantage des « bouts de papiers classés » de Placcius et Leibniz sur le _codex_ de Galilée : les premiers peuvent être facilement réordonnées.
#sequence5_fr.org#:module1/slides/misc/PITCHME.md:Remarquez l'avantage des « bouts de papiers classés » de Placcius et Leibniz sur le _codex_ de Galilée : les premiers peuvent être facilement réordonnées.

Why labels/tags?

A query based on a single word often returns a very large number of proposals, even though most desktop search engines allow you to filter them. An effective way to limit their number is to include in our documents labels, i.e. labelled anchor points, which will be easily indexed by the desktop search engine and whose label does not correspond to any word or phrase in the dictionary–this is a simplified version of the work of the indexer, the person responsible for building a book index–. To keep the label meaningful, simply frame a word with a pair of punctuation marks such as ":", "";" or "?". A label such as ":code:" will be easily memorized and will make a perfect equivalent of the keyword "code" used in the example notebook in the second sequence of this module–to illustrate Locke's method–.

We still have one more technical detail to resolve in the case of our notes taken in text format such as Markdown. Indeed, we do not want our labels to appear in the html, pdf or docx outputs of our notes. A way to do this, for light markup languages that do not have labels–for example, Markdown does not have them, while org has them–is to include them in comments. In Markdown, everything framed by <!-- and --> is considered a comment and is not included in the html or pdf output of the notes. This allows us to use:

<!-- ;code; -->

in our notes at a location we would like to find when we are looking for material on programming.

Metadata

Image files

We now know how to add labels to a text file, but we often also have to work with files containing images or photos, such as JPEG files–digital cameras all use this format–, GIF or PNG. The question then arises, can we add labels to our image files so that our desktop search engines index them? The answer is yes, thanks to the metadata that these files contain. Metadata, in this case, is data stored in the file but not shown by the rendering software (at least not shown by default). We all know that this metadata "exists"; it contains the date, GPS location, exposure time, etc. of our digital photos. In the JPEG files, they are stored according to the the exchangeable image file format (EXIF). Most image and photo manipulation software allows access to and modification of metadata content. The example illustrated in the course uses a very simple "command line" solution, ExifTool that allows you to view and modify metadata. Other software such as exiv2 or ImageMagick allow you to do this (to name only free software available on Linux, Windows and MacOS). Some of the elements of the EXIF format are strings, i.e. text, that we are free to use as we wish; we can therefore use them to add our labels. We illustrate in the course how to do it with ExifTool, but we could also have done it with ImageMagick's program mogrify. All the desktop search engines we mentioned will "look" at the metadata of the JPEG files during the indexing phase and thus allow us to use the labels we have inserted.

EXIF is not the only existing metadata format; a more recent format is the Extensible Metadata Platform (XMP), available for a larger number of file formats–it is not currently read on JPEG files by DocFetcher, so we have highlighted the EXIF format, but this should evolve quite quickly–; other engines like Tracker and Recoll read it.

PDF files

In addition to image files, we are all very frequently called upon to work with "composite" files–containing text, images, and more–that are the PDF files. These files also contain metadata, and it was for them that Adobe initially introduced the XMP format that we just discussed. This metadata can be read and modified, in particular the element Keywords which can contain arbitrary long character strings and is perfect for hosting our labels. The program ExifTool, allows you to modify the metadata of the files PDF. The desktop search engines we mentioned above will all read the metadata from the PDF files during the indexing phase.

Audios files

Audio formats like mp3 and ogg also contain metadata where the song titles, artist names, etc. are stored; these metadata can be set by us and are read by the desktop search engines during indexation.