All Articles

Getting Started with Google's BigQuery

In my internship at ShareChat, I work at the Data Platform Team. Since this team also looks after Data Warehousing aspect, we spend a lot of time on BigQuery.

BigQuery is a fully managed, PB-scale data warehouse that enables super-fast SQL queries using the processing power of GCP’s infrastructure. I like to think of BQ as a powerful tool for storing and analyzing large amounts of data, and the best part about it is that it’s incredibly easy to use, even for someone with no prior experience with data analysis. But like any other platform, BQ has it’s downsides too.

Let’s dissect BQ - one feature at a time!

Easy to use & scalable

  1. Since BQ is fully managed Paas, you can store/ analyze hundreds of TBs of data without having to worry about the underlying infrastructure.
  2. BQ can handle both structured and semi-structured data. This means you can store and analyze data in a variety of formats, including JSON and Avro.
  3. You don’t need to provision storage resources or reserve units of storage. BigQuery automatically allocates storage for you when you load data into the system - this is a boon & a bane both :)

Performance & Features!

  1. Partitioning and Clustering: You can partition & cluster your tables to drastically improve query performance by reducing the amount of data that needs to be scanned.
  2. Columnar Database: BigQuery stores table data in columnar format, meaning it stores each column separately. Column-oriented databases are particularly efficient at scanning individual columns over an entire dataset. So, a SELECT * from Table query is highly inefficient in BQ, you should rather call out only the required columns.
  3. Preview: A peek into the Table’s data without having to query it.
  4. Caching: BQ writes all query results to temp tables (cached results tables). Meaning, if you rerun the same query again, without any changes & if the data in the underlying table(s) that you’re querying hasn’t changed - cached results are reused to show output. So you aren’t charged for this repeated query of yours!
  5. Metadata: How does one monitor & track usage, activity & performance of Tables & queries in a Data Warehouse? Through Metadata! BQ metadata is pretty useful to Data Engg Teams. Here is detailed documentation on how to access this metadata. (More on how to Leverage BQ metadata, soon!)
  6. BQ CLI: BigQuery has a python based command line tool that can also be accessed and operated

Compatible

  1. BigQuery integrates with other Google Cloud services like Google Cloud Storage (GCS), Google Drive, and Google Sheets, which makes it easy to transfer data between these services. This enables you to use BigQuery as a central repository for all your data.
  2. BigQuery also supports data streaming (SKU “Streaming Insert”), which enables you to stream data into BigQuery in real-time.
  3. There are various ways in which you can ingest data in BQ Tables.

Downsides!!

  1. Cost: While BigQuery is relatively inexpensive for small amounts of data and occasional queries, it can become quite expensive for businesses that generate a large amount of data or need to run complex, frequent queries. The cost of storage and query operations can quickly add up, and it’s important to carefully monitor and optimize your usage to keep costs under control.
  2. Limited control over infrastructure: Since BigQuery is a fully managed service, you don’t have direct control over the underlying infrastructure. This means you can’t make custom configurations or fine-tune the performance of the service.
  3. Limited data transformation capabilities: BigQuery is primarily a data warehousing and querying platform, and it has limited capabilities for data transformation. If you need to perform complex data transformations, you may need to use other tools in conjunction with BigQuery.
  4. Limited data privacy controls: As BigQuery is a cloud-based service, you may have concerns about data privacy and security. While BigQuery provides various security features, such as encryption and access controls, it may not provide the level of control over data privacy that some organizations require.