The GA4 update: matching big query data with Google analytics

UPDATE: to get the right number of sessions you need to contact user psudeo id and ga session id

After my series of matching big query data with Universal Analytics being helpful. I thought with the forced update to GA4, and the ability for all to get raw data I would start the series again with GA4 data.

Google provides access to their sample GA4 data from their merchandise store: https://www.googlemerchandisestore.com/. The data set is bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*

This post will look at:

  • Users

  • New Users

  • Sessions

  • Number of Session per User

  • Page Views

  • Pages/Sessions

  • Avg. Session Duration

  • Bounce Rate

Users

Though users should be fairly easy to calculate because GA4 is all user-based, there could be a discrepancy between BQ and GA4 if you use Google Signals. See this article: https://support.google.com/analytics/answer/9445345?hl=en#zippy=%2Cin-this-article

As Google states:

Google Analytics 4 exports event data associated with anonymous cookies to BigQuery. Theoretically, this can result in the same user being counted multiple times across different devices. In contrast, Google signals measures users across devices. Therefore, data exported to BigQuery might show more users when compared with reports based on Google signals data.

But let’s keep things simple and assume that you didn’t activate Google Signals. Here we are getting the users for the month of January 2021

SELECT
  COUNT(DISTINCT user_pseudo_id) AS user_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
where _table_suffix BETWEEN "20210101" and "20210131"

New Users

When people say that GA4 is event based, you better believe it. Everything is an event. Event if it is new users. So if you want to could how many new users you count the user ids that trigger that event:

SELECT
  count(distinct case when event_name = "first_visit" then user_pseudo_id end) as new_users
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
where _table_suffix BETWEEN "20210101" and "20210131"

Sessions


Also your sessions are events, but just make sure you count the timestamp and the user id or you would just get the number of users.

SELECT
  count(distinct case when event_name = "session_start" then 
      concat(user_pseudo_id,event_timestamp) end) as sessions
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
where _table_suffix BETWEEN "20210101" and "20210131"

Pageviews

Though I find this metric useless without context (like which pages) some people like it as it usually a big number you can shout about. Again your page views are events so you can do the same as sessions.

SELECT
  count(distinct case when event_name = "page_view" then 
      concat(user_pseudo_id,event_timestamp) end) as page_views
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
where _table_suffix BETWEEN "20210101" and "20210131"

Bounces and Bounce Rate

The bounce rate has been removed from GA4 in favour of the engagement rate (See below). Some people will still want bounce rate though. Here we are assuming that a bounce is if they only viewed 1 page, as there isn’t interactive/non-interactive events so we don’t want to count events.

SELECT
  count(distinct case when page_views = 1 then ga_session_id end)
  /count(distinct concat(user_pseudo_id, ga_session_id)) as session_bounce_rate
from(
  SELECT
    user_pseudo_id,
    (select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id,
    count(distinct case when event_name = "page_view" then concat(user_pseudo_id,event_timestamp) end) as page_views
  FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
  where _table_suffix BETWEEN "20210101" and "20210131"
  group by 1,2)

Engagement rate

So what we really should be using is engagement rate. Which is easier to get than bounce rate.

SELECT
  sum(cast(session_engaged as int))/
      count(distinct concat(user_pseudo_id, ga_session_id)) as session_engagement_rate
from(
  SELECT
    user_pseudo_id,
    (select value.int_value from unnest(event_params) 
           where key = 'ga_session_id') as ga_session_id,
    max((select value.string_value from unnest(event_params) 
          where key = 'session_engaged')) as session_engaged,
  FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
  where _table_suffix BETWEEN "20210101" and "20210131"
  group by 1,2)

Average Session Duration

Again this has changed slightly to engaged session duration. This will be lower than your Universal Analytics session duration as it only counts when the tab is in focus. Below I show how to do both.

SELECT
  sum(engagement_time_msec)/1000 #in milliseconds
  /count(distinct concat(user_pseudo_id,ga_session_id)) as ga4_session_duration,
  sum(end_time-start_time)/1000000 #timestamp in microseconds
  /count(distinct concat(user_pseudo_id,ga_session_id)) as ua_session_duration,

from(
  SELECT
    user_pseudo_id,
    (select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id,
   max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec')) as engagement_time_msec,
   min(event_timestamp) as start_time,
   max(event_timestamp) as end_time
  FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
  where _table_suffix BETWEEN "20210101" and "20210131"
  group by 1,2)

Number of Sessions per User

This one is pretty easy after a couple of hard ones before. Just dived session by user.

SELECT
  count(distinct case when event_name = "session_start" then 
      concat(user_pseudo_id,event_timestamp) end)/ 
      count(distinct user_pseudo_id) as sessions_per_user
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
where _table_suffix BETWEEN "20210101" and "20210131"

Pages per Session

Again fairly similar to the previous - divid page views by sessions.

SELECT
  count(distinct case when event_name = "page_view" then 
      concat(user_pseudo_id,event_timestamp) end)/
      count(distinct case when event_name = "session_start" then 
      concat(user_pseudo_id,event_timestamp) end) as page_per_session
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
where _table_suffix BETWEEN "20210101" and "20210131"

Putting it all together

So now if you want them all in one query, this time I’m using with clauses for the nested queries, here you go:

with session_details as (
  SELECT
  count(distinct case when page_views = 1 then ga_session_id end)
  /count(distinct concat(user_pseudo_id, ga_session_id)) as session_bounce_rate,
  sum(cast(session_engaged as int))/
      count(distinct concat(user_pseudo_id, ga_session_id)) as session_engagement_rate,
  sum(engagement_time_msec)/1000 #in milliseconds
    /count(distinct concat(user_pseudo_id, ga_session_id)) as ga4_session_duration,
  sum(end_time-start_time)/1000000 #timestamp in microseconds
    /count(distinct concat(user_pseudo_id, ga_session_id)) as ua_session_duration
from(
  SELECT
    user_pseudo_id,
    (select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id,
    count(distinct case when event_name = "page_view" then concat(user_pseudo_id,event_timestamp) end) as page_views,
     max((select value.string_value from unnest(event_params) where key = 'session_engaged')) as session_engaged,
      max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec')) as engagement_time_msec,
   min(event_timestamp) as start_time,
   max(event_timestamp) as end_time
  FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
  where _table_suffix BETWEEN "20210101" and "20210131"
  group by 1,2)
)


SELECT
  COUNT(DISTINCT user_pseudo_id) AS user_count,
  count(distinct case when event_name = "first_visit" then 
      user_pseudo_id end) as new_users,
  count(distinct case when event_name = "session_start" then 
      concat(user_pseudo_id,event_timestamp) end) as sessions,
  count(distinct case when event_name = "page_view" then 
      concat(user_pseudo_id,event_timestamp) end) as page_views,
  max(session_bounce_rate) as session_bounce_rate,
  max(session_engagement_rate) as session_engagement_rate,
  max(ga4_session_duration) as ga4_session_duration,
  max(ua_session_duration) as ua_session_duration,
  count(distinct case when event_name = "session_start" then 
      concat(user_pseudo_id,event_timestamp) end)/ 
      count(distinct user_pseudo_id) as sessions_per_user,
  
  count(distinct case when event_name = "page_view" then 
      concat(user_pseudo_id,event_timestamp) end)/
      count(distinct case when event_name = "session_start" then 
      concat(user_pseudo_id,event_timestamp) end) as page_per_session
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
cross join (select * from session_details) 
where _table_suffix BETWEEN "20210101" and "20210131"

And there we are. I will keep on updating the queries from the UA series. If there is anything in particular you would like let me know.

Previous
Previous

Case Study: Time = Money for Biophilia Designs

Next
Next

Case study: Ecommerce store - The waiting