Views can be a useful way to encapsulate querying your data lake, or complex queries that combine your local Postgres data with your data lake. Views you define in Postgres are supported by DuckDB and can query your data lake.

Example

I want to run some analytics on my sales data, as well as combine that with new transactions that are coming in. In this example, I will have active data in public.sales, then data store in my data lake referred to in the data_lake schema, and finally an analytics schema that has data combined from both sources.

First, lets create our schemas:

CREATE SCHEMA data_lake;
CREATE SCHEMA analytics;

The next step is to copy my sales data to S3, and create a view to encapsulate querying it:

COPY sales TO 's3://bucket/sales_0.parquet';

CREATE VIEW data_lake.sales AS
SELECT * FROM read_parquet('s3://bucket/sales_*.parquet') AS
(id BIGINT, userid BIGINT, skuid BIGINT, sold_at TIMESTAMP, quantity BIGINT,
 unit_price DECIMAL(12,2), total_amount DECIMAL(12,2));

Now, lets store the maximum ID was that was copied into a table so we can use it in our next view:

CREATE TABLE data_lake.sales_metadata (max_id bigint);
TRUNCATE data_lake.sales_metadata;
INSERT INTO data_lake.sales_metadata SELECT MAX(id) FROM data_lake.sales;

I can now create a view that combines the data to get all sales in real-time:

CREATE VIEW analytics.sales AS
SELECT * FROM data_lake.sales
UNION
SELECT * FROM public.sales WHERE id > (SELECT max_id FROM data_lake.sales_metadata);

Using this view, I can query this data, and join it with other Postgres tables as well, like my skus table:

WITH best_products_q3 AS (
  SELECT skuid, SUM(total_amount) AS total
  FROM analytics.sales
  WHERE sold_at BETWEEN '2024-06-01' AND '2024-09-01'
  GROUP BY skuid
  ORDER BY total DESC
  LIMIT 100
)
SELECT skuid, total, skus.name
FROM best_products_q3
INNER JOIN skus ON best_products_q3.skuid = skus.skuid;

Using this pattern, you can update your data lake with new data on a daily, weekly, or monthly basis.

Limitations

DuckDB can query from materialized views, but data loaded into materialized views is executed by Postgres. As a result, materialized views cannot query your data lake. This will be addressed in the future.