clickstream keyword dimension
2 posters
Page 1 of 1
clickstream keyword dimension
I'm currently trying to determine how to build out a keyword dimension table. We're tracking website visits to our website, and would like to be able to find the most used keywords used to search via a search engine for the site as well as any search terms used during the visit on the site (price > $100, review > 4 stars, etc). Since the keywords are completely dynamic and can be used in an infinite number of combinations, I'm having a hard time trying to determine how to store these keywords. I have a pageview fact table that includes a record every time a page is viewed. The source I'm pulling from includes all the search terms in a delimited list I am able to parse with a regular expression, I just don't know how to store it in the database since the number of keywords can vary so widely from pageview to pageview. I'm thinking this may be more suited for a NOSQL solution that trying to cram it into a MSSQL table, but I don't know. Any help is greatly appreciated!
crosan- Posts : 1
Join date : 2013-03-18
Re: clickstream keyword dimension
I did one in the late 90's using Oracle for a media website. They were capturing search phases used to get to the site. There was only one phrase for the session, not each view.
Anyway, a phrase represents a collection of words. So, each unique phrase was captured (phrase dimension) and each word in the phrase was parsed and accumulated (keyword dimension, one row per unique word). We maintained a bridge table that contained the phrase key and the keyword key. There was a row for each word found in the phrase.
The keyword dimension contained the actual word found and a 'corrected' word. The corrected word would be the same for various variations and misspellings of a word. At the time, it was maintained manually as issues arose.
The fact would contain a FK to the phrase. The phrase/keyword bridge would be used to locate transactions using specific words. You can do some fairly elaborate selects using moderately complex SQL... "pages that contain 'A' and 'B' but not 'C'" for example.
Performance was very good <60 sec, for complex queries against a 4 billion row fact table. A decent MPP system today would rip through it in a second or two.
Anyway, a phrase represents a collection of words. So, each unique phrase was captured (phrase dimension) and each word in the phrase was parsed and accumulated (keyword dimension, one row per unique word). We maintained a bridge table that contained the phrase key and the keyword key. There was a row for each word found in the phrase.
The keyword dimension contained the actual word found and a 'corrected' word. The corrected word would be the same for various variations and misspellings of a word. At the time, it was maintained manually as issues arose.
The fact would contain a FK to the phrase. The phrase/keyword bridge would be used to locate transactions using specific words. You can do some fairly elaborate selects using moderately complex SQL... "pages that contain 'A' and 'B' but not 'C'" for example.
Performance was very good <60 sec, for complex queries against a 4 billion row fact table. A decent MPP system today would rip through it in a second or two.
Similar topics
» Keyword dimension question
» Question about keyword outrigger
» clickstream fact data coming in with different levels of dimensional geography data
» Replacing Aggregate dimension with Conformed base dimension at lowest level.
» How to handle a Type I or II dimension with a snowflaked customer sub dimension (kimball book page 337, 338)
» Question about keyword outrigger
» clickstream fact data coming in with different levels of dimensional geography data
» Replacing Aggregate dimension with Conformed base dimension at lowest level.
» How to handle a Type I or II dimension with a snowflaked customer sub dimension (kimball book page 337, 338)
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum