Importing existing datasets

The project is bringing together data on books, borrowers and borrowings that have been transcribed by previous projects, with the aim of making these datasets cross-searchable.  Our project database consists of some 23 interrelated tables and we have a Content Management System through which the project team can add and edit records for libraries, ledgers, pages, borrowing records, borrowers and books.

We have datasets from several libraries, each of which has been compiled differently and therefore needs to be processed differently.  I have already written scripts to import two datasets: the student records from Glasgow University library and the records from Innerpeffray library.  The former gave us 8,188 borrowing records across three ledgers and 403 pages while the latter gave us 6,550 borrowing records over 340 pages of one ledger.

Over the past week I imported data from a further two libraries: St Andrews University and Selkirk, and in this post I’m going to discuss how this process was handled, focussing mainly on St Andrews.  The data from St Andrews are CSV and HTML representations of transcribed pages that come from an existing website with pages that look like this:  The links in the pages (e.g. Locks Works) lead through to further pages with information about books or borrowers.  Unfortunately the CSV version of the data doesn’t include the links or the linked to data, and as I wanted to try and pull in the data found on the linked pages I therefore needed to process the HTML instead.

I wrote a script that pulled in all of the HTML files and processed each in turn.  From the filenames my script could ascertain the ledger volume, its dates and the page number.  For example ‘Page_UYLY205_2_Receipt_Book_1748-1753.djvu_10.html’ is ledger 2 (1748-1753) page 10.  The script creates ledgers and pages, and adds in the ‘next’ and ‘previous’ page links to join all the pages in a ledger together.

The actual data in the file posed further problems.  As you can see from the linked page above, dates are just too messy to automatically extract into our strongly structured borrowed and returned date system.  Often a record is split over multiple rows as well (e.g. the borrowing record for ‘Rollins belles Lettres’ is actually split over 3 rows).  I could have just grabbed each row and inserted it as a separate borrowing record, which would then need to be manually merged, but I figured out a way to do this automatically.  The first row of a record always appears to have a code (the shelf number) in the second column (e.g. J.5.2 for ‘Rollins’) whereas subsequent rows that appear to belong to the same record don’t (e.g. ‘on profr Shaws order by’ and ‘James Key’).  I therefore set up my script to insert new borrowing records for rows that have codes, and to append any subsequent rows that don’t have codes to this record until a row with a code is reached again.

I also used this approach to set up books and borrowers.  If you look at the page linked to above again you’ll see that the links through to things are not categorised – some are links to books and others to borrowers, with no obvious way to ascertain which is which.  However, it’s pretty much always the case that it’s a book that appears in the row with the code and it’s people that are linked to in the other rows.  I could therefore create or link to existing book holding records for links in the row with a code and create or link to existing borrower records for links in rows without a code.  There are bound to be situations where this system doesn’t quite work correctly, but I think the majority of rows do fit this pattern.

The next thing I needed to do was to figure out which data from the St Andrews files should be stored as what in our system.  I created four new ‘Additional Fields’ for St Andrews as follows:


  • Original Borrowed date: This contains the full text of the first column (e.g. Decr 16)
  • Code: This contains the full text of the second column (e.g. J.5.2)
  • Original Returned date: This contains the full text of the fourth column (e.g. Jan. 5)
  • Original returned text: This contains the full date of the fifth column (e.g. ‘Rollins belles Lettres V. 2d’)

In the borrowing table the ‘transcription’ field is set to contain the full text of the ‘borrowed’ column, but without links.  Where subsequent rows contain data in this column but no code, this data is then appended to the transcription.  E.g. the complete transcription for the third item on the page linked to above is ‘Rollins belles Lettres Vol 2<sup>d</sup> on profr Shaws order by James Key’.

The contents of all pages linked to in the transcriptions are added to the ‘editors notes’ field for future use if required.  Both the page URL and the page content are included, separated by a bar (|) and if there are multiple links these are separated by five dashes.  E.g. for the above the notes field contains:

‘Rollins_belles_Lettres| <p>Possibly: De la maniere d’enseigner et d’etuder les belles-lettres, Par raport à l’esprit &amp; au coeur, by Charles Rollin. (A Amsterdam : Chez Pierre Mortier, M. DCC. XLV. [1745]) <a href=”″></a></p>

—– profr_Shaws| <p><a href=”″></a></p>

—– James_Key| <p>Possibly James Kay: <a href=”″></a></p>


As mentioned earlier, the script also generates book and borrower records based on the linked pages too.  I’ve chosen to set up book holding (specific to the library) rather than book edition (project-wide) records as the details are all very vague and specific to St Andrews.  In the holdings table I’ve set the ‘standardised title’ to be the page link with underscores replaced with dashes (e.g. ‘Rollins belles Lettres’) and the page content is stored in the ‘editors notes’ field.  One book item is created for each holding to be used to link to the corresponding borrowing records.

For borrowers a similar process is followed, with the link added to the surname column (e.g. Thos Duncan) and the page content added to the ‘editors notes’ field (e.g. <p>Possibly Thomas Duncan: <a href=”″></a></p>’).  All borrowers are linked to records as ‘Main’ borrowers.

During the processing I noticed that the first and fourth ledgers had a slightly different structure to the others, with entire pages devoted to a particular borrower, whose name then appeared in a heading row in the table (see for example  I therefore updated my script to check for the existence of this heading row, and if it exists my script then grabs the borrower name, creates the borrower record if it doesn’t already exist and then links this borrower to every borrowing item found on the page.  After my script had finished running we had 11,147 borrowing records, 996 borrowers and 6,395 book holding records for St Andrew in the system.

I then moved onto looking at the data for Selkirk library.  This data was more nicely structured than the St Andrews data, with separate spreadsheets for borrowings, borrowers and books and borrowers and books connected to borrowings via unique identifiers.  Unfortunately the dates were still transcribed as they were written rather than being normalised in any way, which meant it was not possible to straightforwardly generate structured dates for the records and these will need to be manually generated.  The script I wrote to import the data took about a day to write, and after running it we had a further 11,431 borrowing records across two registers and 415 pages entered into our database.

As with St Andrews, I created book records as Holding records only (i.e. associated specifically with the library rather than being project-wide ‘Edition’ records.  There are 612 Holding records for Selkirk.  I also processed the borrower records, resulting in 86 borrower records being added.  I added the dates as originally transcribed to an additional field named ‘Original Borrowed Date’ and the only other additional field is in the Holding records for ‘Subject’, that will eventually be merged with our ‘Genre’ when this feature becomes available.

We now have data from four libraries in our database, and I have access to a further two datasets (for Haddington and Leighton) which I will import next week.  Once all of the existing data is entered into the system the project team will then need to devote quite a bit of time to cleaning the data – fixing dates, removing any erroneous data, amalgamating any duplicate borrowers or books, creating project-wide book edition and book work records and other such tasks that will eventually enable the process of searching for books, borrowers and dates across the various libraries the project will cover.