Hydra supports querying data in your data lake in multiple formats:

All function options/parameters supported by these functions in DuckDB are supported. To use the parameters, use the Postgres optional parameter syntax with the => operator. For example, sample_size => 20_000.

Specifying Return Types

For all formats, Postgres requires that you specify the types of the columns used in your query using the AS syntax following the function. For example, with Parquet:

-- you need only specify columns used in your query
SELECT MIN(UserID)
FROM read_parquet('s3://.../hits.parquet')
  AS (UserID bigint);

If no columns are being explicity used, you must specify at least one (any) column.

-- if no columns are being explicitly used, specify any column
SELECT COUNT(*)
FROM read_parquet('s3://.../hits.parquet')
  AS (UserID bigint);

Parquet

Reading

-- specific file
SELECT * FROM read_parquet('s3://.../hits.parquet') AS (UserID bigint);

Glob and array syntax are supported:

-- glob syntax
SELECT * FROM read_parquet('s3://.../*.parquet') AS (UserID bigint);

-- array syntax
SELECT * FROM read_parquet([
  's3://.../hits1.parquet',
  's3://.../hits2.parquet'
]) AS (UserID bigint);

Please see the DuckDB documentation for read_parquet for further usage information.

Writing

You can copy an entire table or the results from a query to a Parquet file using COPY:

-- copy entire table
COPY some_table TO 's3://your-bucket/some_table.parquet';

-- copy results from a query
COPY (
  SELECT * FROM some_table WHERE foo > 1
)
TO 's3://your-bucket/some_table_query.parquet';

CSV

Reading

To read a CSV, use the read_csv function. All options supported by DuckDB are supported.

SELECT *
FROM read_csv('s3://your-bucket/flights.csv',
    delim => '|',
    header => true)
AS (FlightDate DATE,
    UniqueCarrier VARCHAR,
    OriginCityName VARCHAR,
    DestCityName VARCHAR);

Writing

You can copy an entire table or the results from a query to a CSV file using COPY:

-- copy entire table
COPY some_table TO 's3://your-bucket/some_table.csv';

-- copy results from a query
COPY (
  SELECT * FROM some_table WHERE foo > 1
)
TO 's3://your-bucket/some_table_query.csv';

Iceberg

Unlike Parquet and CSV, Iceberg is referred to by directory. To read Iceberg, use iceberg_scan to read a directory containing a valid Iceberg table. Iceberg support is read-only.

To enable Iceberg support, you must first enable the DuckDB extension:

SELECT duckdb.install_extension('iceberg');

Then query using iceberg_scan:

SELECT MIN(UserID)
FROM iceberg_scan('s3://.../iceberg/hits')
  AS (UserID bigint);

Delta

Delta support is read-only. To enable Delta support, you must first enable the DuckDB extension:

SELECT duckdb.install_extension('delta');

Then query using delta_scan:

SELECT MIN(UserID)
FROM delta_scan('s3://some/delta/table')
  AS (UserID bigint);