clickstream keyword dimension

crosan Mon Mar 18, 2013 5:07 pm

I'm currently trying to determine how to build out a keyword dimension table. We're tracking website visits to our website, and would like to be able to find the most used keywords used to search via a search engine for the site as well as any search terms used during the visit on the site (price > $100, review > 4 stars, etc). Since the keywords are completely dynamic and can be used in an infinite number of combinations, I'm having a hard time trying to determine how to store these keywords. I have a pageview fact table that includes a record every time a page is viewed. The source I'm pulling from includes all the search terms in a delimited list I am able to parse with a regular expression, I just don't know how to store it in the database since the number of keywords can vary so widely from pageview to pageview. I'm thinking this may be more suited for a NOSQL solution that trying to cram it into a MSSQL table, but I don't know. Any help is greatly appreciated!

ngalemmo Mon Mar 18, 2013 6:47 pm

I did one in the late 90's using Oracle for a media website. They were capturing search phases used to get to the site. There was only one phrase for the session, not each view.

Anyway, a phrase represents a collection of words. So, each unique phrase was captured (phrase dimension) and each word in the phrase was parsed and accumulated (keyword dimension, one row per unique word). We maintained a bridge table that contained the phrase key and the keyword key. There was a row for each word found in the phrase.

The keyword dimension contained the actual word found and a 'corrected' word. The corrected word would be the same for various variations and misspellings of a word. At the time, it was maintained manually as issues arose.

The fact would contain a FK to the phrase. The phrase/keyword bridge would be used to locate transactions using specific words. You can do some fairly elaborate selects using moderately complex SQL... "pages that contain 'A' and 'B' but not 'C'" for example.

Performance was very good <60 sec, for complex queries against a 4 billion row fact table. A decent MPP system today would rip through it in a second or two.

clickstream keyword dimension

clickstream keyword dimension

Re: clickstream keyword dimension