Matching Big Query Data with Google Analytics - Audience Overview

Whether your Google analytics is sampling too much or you are trying to blend GA data with offline data sets there are a few reasons why you would want to recreate reports from Google Analytics with BigQuery. Below is a list on how to get each metric separately and skip to the bottom if you want on query with them all together. I won’t go through SQL syntax, there are plenty of places that do that.

Below we will be replicating:

Screen Shot 2020-07-22 at 3.31.10 pm.png

I’m using the free Google Analytics data set for the example, that is why it is set to 2017…ahh 2017 you seem so young and innocent now.

Users

Google BigQuery uses fullvisitorID to identify users. These are linked to cookies, so clear your cookies and become a new user. So to get the number of users we need to count the distinct fullvisitorids. BE CAREFUL, if you are using dates, if you count users per day, you can’t sum them up to find the total number of users over a week, a month, a year. In this example, we are just getting the total so we don’t need to worry about that.

SELECT
    COUNT(DISTINCT fullvisitorId) AS users
FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  WHERE
    _table_suffix BETWEEN "20170701"
    AND "20170731"
  
  

New Users

New users are just distinguished by their visit number if visitNumber is 1 then this is their first visit, i.e. they are a new user. So we will only count the user that has their visitNumber = 1.


SELECT
    COUNT(DISTINCT
      CASE
        WHEN visitNumber = 1 THEN fullvisitorId
    END
      ) AS new_users
FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  WHERE
    _table_suffix BETWEEN "20170701"
    AND "20170731"
    
    

Sessions

Given that the google analytics table is based on a session it is strange there is no one unique identifier to count the sessions that would match up with Google Analytics. There is one way to do this, which is to add up all the totals.visits but this causes problems if we start to explore page and event data, so it is recommended to use the following method.


 SELECT
    COUNT(DISTINCT CONCAT(fullvisitorId, CAST(visitId AS string), date)) AS sessions,
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
    UNNEST (hits) hits
  WHERE
    _table_suffix BETWEEN "20170701"
    AND "20170731"
    AND totals.visits = 1

By adding together the fullvisitorid (the user), visitID (unique id for that user on when they came to the site, though not unique across users) and the date (as Google Analytics likes to make new sessions at midnight) we can get a unique id for each session. Some of these sessions are non-interactive, e.g. sending a payment from an offline source. This is why you need totals.visits =1 to make sure it was an interactive session.

Pageviews

There are a few ways to get the number of page views as well, you could use totals.pagviews and sum that up but again this wouldn’t work when you start exploring pages and events in more details. The following will work even if you unnest other parts of the table.


SELECT
	COUNT(DISTINCT CASE WHEN hits.type = "PAGE" THEN CONCAT(fullvisitorID, cast(visitId as STRING), date, hits.hitNumber) END) as pageviews
FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
    UNNEST (hits) hits
  WHERE
    _table_suffix BETWEEN "20170701"
    AND "20170731"
    AND totals.visits = 1
    
    

Basically, counting each session and the number of pageviews in the session and adding it all together.


Bounces and Bounce Rate

To do this the first thing to do is get the total number of bounced sessions, and the total number of all sessions and then divide one by the other:


SELECT
  round(bounces/sessions*100,2) AS bounceRate
FROM (
  SELECT
    COUNT(DISTINCT CONCAT(fullvisitorId, CAST(visitId AS string), date)) AS sessions,
   COUNT(DISTINCT
      CASE
        WHEN totals.bounces = 1 THEN CONCAT(fullvisitorId, CAST(visitId AS string), date)
    END
      ) AS bounces
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
    UNNEST (hits) hits
  WHERE
    _table_suffix BETWEEN "20170701"
    AND "20170731"
    AND totals.visits = 1)
  
  

Average Session Duration

This one gets difficult, depending on how you have unnested your tables. If you don’t unnest anything you can use avg(totals.timeOnSite) nice and easy. If you do this when you have unnested hits though you will get a double up for all the users who had more than one hit, the table will look like this:

fullvisitorIdvisitnumberhitNumbertimeOnSite
000156346993487637211273
000156346993487637212273
000156346993487637213273
000156346993487637214273
000156346993487637215273
000156346993487637216273
00015634699348763722113
00015634699348763722213
0014201933474364573115
0014201933474364573125

So in this case where we are unnesting hits we need to divide the time on site by total hits so we don’t double count.


SELECT
  total_session_duration/sessions AS avg_session_duration
FROM (
  SELECT
    COUNT(DISTINCT CONCAT(fullvisitorId, CAST(visitId AS string), date)) AS sessions,
    SUM(totals.timeOnSite/ totals.hits ) AS total_session_duration
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
    UNNEST (hits) hits
  WHERE
    _table_suffix BETWEEN "20170701"
    AND "20170731"
    AND totals.visits = 1)
    
    

Number of Sessions per User

The rest are calculations of the other metrics, this is just sessions/users


SELECT
  round(sessions/users,2) AS sessions_per_users
FROM (
  SELECT
    COUNT(DISTINCT fullvisitorId) AS users,
    COUNT(DISTINCT CONCAT(fullvisitorId, CAST(visitId AS string), date)) AS sessions,
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
    UNNEST (hits) hits
  WHERE
    _table_suffix BETWEEN "20170701"
    AND "20170731"
    AND totals.visits = 1)
    
    

Pages per Session

As the name suggests this is pageviews/sessions.

SELECT
  round(pageviews/sessions,2) AS pageviews_per_session
FROM (
  SELECT
    COUNT(DISTINCT CONCAT(fullvisitorId, CAST(visitId AS string), date)) AS sessions,
    COUNT(DISTINCT CASE WHEN hits.type = "PAGE" THEN CONCAT(fullvisitorID, cast(visitId as STRING), date, hits.hitNumber) END) as pageviews
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
    UNNEST (hits) hits
  WHERE
    _table_suffix BETWEEN "20170701"
    AND "20170731"
    AND totals.visits = 1)
    
    

Putting it all together

So now if you want them all in one query, here you go:


  SELECT
  users,
  new_users,
  sessions,
  round(sessions/users,2) AS sessions_per_users,
  round(pageviews/sessions,2) AS pageviews_per_session,
  pageviews,
  round(total_session_duration/sessions,2) as avg_session_duration_secs,
  round(bounces/sessions*100,2) AS bounceRate
FROM (
  SELECT
    COUNT(DISTINCT fullvisitorId) AS users,
    COUNT(DISTINCT
      CASE
        WHEN visitNumber = 1 THEN fullvisitorId
    END
      ) AS new_users,
    COUNT(DISTINCT CONCAT(fullvisitorId, CAST(visitId AS string), date)) AS sessions,
    COUNT(DISTINCT CASE WHEN hits.type = "PAGE" THEN CONCAT(fullvisitorID, cast(visitId as STRING), date, hits.hitNumber) END) as pageviews,
    sum(totals.timeOnSite/ totals.hits ) AS total_session_duration,
    COUNT(DISTINCT
      CASE
        WHEN totals.bounces = 1 THEN CONCAT(fullvisitorId, CAST(visitId AS string), date)
    END
      ) AS bounces
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
    UNNEST (hits) hits
  WHERE
    _table_suffix BETWEEN "20170701"
    AND "20170731"
    AND totals.visits = 1)
    
    
Previous
Previous

Matching Big Query Data with Google Analytics - Acquisition Channel report

Next
Next

How to do CRO without being a wanker