Designing a data warehouse to store academic publications for natural language processing

kmore Thu Oct 13, 2011 3:41 pm

I'm embarking on a relatively complicated database design initiative focused on aggregating academic publications. These can include journal articles, books, etc.; the amount of data will (eventually) be quite large. From the end-user's perspective, the goal is to ``find similar publications or sections therein.'' For example, a user might start with a few key terms and might want to find up to 100 publications that contain these terms.

After reading 10 out of the corpus of 100 publications, the user might encounter 1 that she likes, and 9 that she dislikes. Through a couple machine learning techniques, I'm able to extrapolate from those classifications a search query that will yield the publication that she likes and exclude the 9 that she doesn't like, given the 100-publication corpus. Running this query on the database and excluding the document she has already considered and liked gives the user another batch of documents more closely suited to her needs; this reveals the inherently genetic nature of the project.

In thinking about the initial database model, I've concluded that word-level granularity is most appropriate. The fact table could have foreign keys to the publication dimension, dictionary dimension, and entity dimension. The fact table could ``measure'' whether that word is a stopword (i.e., the word offers no semantic meaning in context), and the word's position in the document hierarchy. Because I don't believe there to be too many unique ``hierarchy'' combinations, it might be prudent to implement a one-to-many relationship between the fact table and a bridge table. The bridge table would have attributes sufficient to unambiguously connote a word's position within the document (e.g., its section number, paragraph number, and sentence number, with the labels ``SECTION,'' ``PARAGRAPH,'' and ``SENTENCE'' irrelevant).

The dictionary dimension contains every word heretofore seen in all documents---probably on the order of 300,000 records. The entity dimension maintains a list of ``named entities'' within publications, such as ``Lake Erie'' or ``President Obama.'' To avoid altering the publication's content in the data warehouse, ``Lake'' and ``Erie'' would exist as separate records, with each pointing to the same named entity. Likewise, ``President Obama,'' ``Mr. Obama,'' etc. would point to the same ``President Barack Obama'' named entity. I realize that this generates a lot of snowflaking, but the relative sparsity of named entities when combined with the rest of the corpus likely eliminates any performance impairments. Moreover, when dealing with pronouns, entity resolution will prove to be an extremely useful metric. Additionally, the dictionary dimension includes attributes for all of the parts of speech seen so far. Conjoining the word, its sense, and its part of speech are guaranteed to be unique to the table and therefore may act as the row's primary key.

By way of that introduction, I'm just looking for a little guidance concerning the hierarchy of the publications dimension, and have the following questions:

What's the best way to organize the publications dimension? The problem I have is that there are many types of possible publications, and incorporating attributes for each of these publication types will result in many NULL attributes for things like ``patent number'' when dealing with journal articles. It might be better to create fact tables for each publication type, and then create a publications view that joins all of the publication-specific fact tables and publication-specific dimensions together into a massive view.
Is word-level granularity appropriate? The statistics I'd want to compute necessitate knowing the location of words within sentences---out of this metric falls relationships between documents and concepts. Moreover, it allows me to perform initial searches on document titles or abstracts or other document-specific sections. In other words, it gives me much more control over selecting the data that will eventually be considered. The downside is that literally every word in a document gets its own record. Considering that many journal articles contain thousands of words, we very quickly get into millions of documents generating billions of records. As long as the average number of words is constant, this is still a linear relationship between number of documents and number of records, so I'm not sure if this is a big deal.

Thanks for your time!