Configuration Options#

This page explains each configuration option in detail.

For documentation on actual functions, see the Reference.

Overview#

Remember, the basic steps for your configuration file are:

  • output: Set up your output options.

  • input: (Optional) Set any input options, for changes you want to make to data after you input it, but before you run your queries.

  • tables_config: Define your tables you want to query.

  • import: (Optional) If you want to apply a custom Python function to any of your queries, import your code file(s).

  • queries: Define your queries.

Does Order Matter?#

Order does not matter for keys, including top-level keys like output:. You can put the output: block at the top, at the end, wherever you want.

Within a block, you can order the keys however you want. E.g., each table in tables_config: is a key, so it doesn’t matter which table you define first.

But order does matter for list items. E.g., within import:, it does matter which order you import sources. The order is up to you, and for most reports, the order probably won’t affect the outcome. You just need to be aware that list items are necessarily processed in the order you write them, while keys aren’t.

See When Does Order Matter?

output:#

REQUIRED.

Configures your final report.

output:
  dir: output
  basename: BASENAME
  export_tables: csv
  export_queries: csv
  styles:
    column_width: 15

dir:#

REQUIRED. The output directory for your report and other exports.

If the directory does not exist, it will be created.

Important

This path is relative to your config file. See Are You in the Right Directory?

Tip

When you rerun a report, you will usually need to overwrite the output from your last report. If you don’t want to manually confirm each overwrite, add -f or --force.

basename:#

REQUIRED. The basename for your output file(s).

The way this basename is used depends on your export choices. See export_tables: and export_queries: below.

export_tables:#

Optional. Output format for your tables.

If omitted, no tables are exported.

csv

Export each table as a separate CSV file, named with the table’s name: key.

xlsx

Export each table as a separate sheet in a single spreadsheet named tables.xlsx.

All files are exported to the output directory set in dir: above.

export_queries:#

Optional. Output format for your queries.

If omitted, defaults to xlsx.

csv

Export each query as a separate CSV file.

xlsx

Export each query as a separate sheet in a single spreadsheet named with basename: (see above) and extension xlsx.

styles:#

Optional. Options for formatting your output.

column_width:#

Optional. Set the width for columns in your output spreadsheet.

This should be an integer.

If omitted, this will be an internal default.

TODO#

Add more output styling options?

input:#

Optional. Options for modifying your input tables before you run the queries.

input:
  strip: false
  slugify_columns: false
  lowercase_columns: false
  uppercase_rows: false
  include_index: false

Important

In general, if you omit any of these options, your data will be unchanged. Yarm only alters your data if you tell it to.

Since you can omit this entire block, all options in this section are optional.

strip:#

If omitted, defaults to false.

true

Strip all whitespace from beginning and end of each data cell.

false

Data is unchanged.

slugify_columns:#

If omitted, defaults to false.

true

“Slugify” all column names. This replaces all punctuation and spaces with an underscore. Rows stay unchanged.

false

Data is unchanged.

See also

This option does not change case. To also lowercase your columns, see lowercase_columns:

Attention

If you alter your column names now, be sure to use these altered names later, not only in your queries: but also if you reference any columns in tables_config:, i.e. with datetime: or pivot:.

lowercase_columns:#

If omitted, defaults to false.

true

Lowercase all column names. Rows stay unchanged.

false

Data is unchanged.

See also

You may also want to use slugify_columns:

Attention

If you alter your column names now, be sure to use these altered names later, not only in your queries: but also if you reference any columns in tables_config:, i.e. with datetime: or pivot:.

uppercase_rows:#

If omitted, defaults to false.

true

UPPERCASE the data in all rows. Column names stay unchanged.

false

Data is unchanged.

Note

Uppercasing makes your data more consistent, which can be useful for data like addresses.

include_index:#

If omitted, defaults to false, except for tables which use pivot:. Pivot tables default to true (see below.)

true

Include a separate column for the index in the final output.

false

Silently omit the index column.

Important

You can override this value for each particular table in tables_config:.

When each table is created in the database, an index column is also created. Usually, this index is added automatically, so it’s confusing and unnecessary to see it in the final output.

Pivot tables are different, because in order to pivot a table, we manually set one of the original columns as the index. In this case, we default to true, so that this original column is kept in our final output.

tables_config:#

REQUIRED. Define one or more tables of source data.

tables_config:
  TABLE_NAME_A:
    - path: SOURCE_A1.csv
      include_index: false
    - path: SOURCE_A2.csv
  TABLE_NAME_B:
    - path: SOURCE_B.xlsx
      sheet: B.1
      pivot:
        index: ID_COLUMN
        columns: KEY_COLUMN
        values: VALUE_COLUMN
      datetime:
        # You can supply a custom format string.
        COLUMN_1: "%Y-%m"
        # To use default datetime format, omit format string.
        COLUMN_2:
        # Spaces or punctuation in the column name? Add quotes.
        "COLUMN 3":

Each Table Needs a Unique Name#

In this example, we define two tables, TABLE_NAME_A and TABLE_NAME_B. (Your tables will probably have slightly more normal names, like Customers.)

Each table name must be unique, distinct not only from other tables but also from your queries (see query:name:).

Avoid spaces and punctuation. You’ll be referencing these names in your queries, and if you don’t use simple table names, you’ll need to quote them in the query, which is a needless hassle.

Each Table Is a List of Sources#

Underneath each table, you have a list of one or more sources.

Each source is one table of data. Such as:

  • A CSV file

  • One sheet in a spreadsheet

Most tables probably only need one source. You take a CSV file or a sheet in a spreadsheet, and you turn it into a table. Simple.

Multiple Sources Get Merged#

But sometimes you want to combine two or more CSVs or sheets into a single table. You can do this, as long as all the sources share at least one column in common.

Behind the scenes, each source gets merged as an outer join (see yarm.tables.input_source()), and this requires a shared column.

In the example config above, both SOURCE_B1.csv and SOURCE_B2.csv get merged into a single table, TABLE_NAME_B.

Source Options#

Each source must have, at minimum, a path:. But you have other options as well.

Remember, each source is a separate list item beneath its table.

Table Source Order of Operations#

No matter what order you place these keys within each table source, the operations run in this order:

  1. The file in path: (and, for a spreadsheet, the sheet: for this path), is read into the database.

  2. If pivot: is defined, the table is pivoted. (This will usually create many new columns.)

  3. If datetime: is defined, those columns are formatted.

See also

Table operations order is set in: yarm.tables.df_tables_config_options()

path:#

REQUIRED. Path to source file.

Important

This path is relative to your config file. See Are You in the Right Directory?

Valid file formats for source paths:#

  • .csv

  • .xlsx

sheet:#

Optional. Name or number of sheet in spreadsheet.

Only relevant for spreadsheets.

If omitted, the first sheet will be used.

Important

Normally, you’ll probably use the name of the spreadsheet. If you opt to use the number, note that numbering starts at 0. The first sheet is 0, the second is 1, etc.

pivot:#

Optional. Pivot a table.

tables_config:
  TABLE_NAME_A:
    - path: SOURCE_A.xlsx
      pivot:
        index: POST_ID
        columns: META_KEY
        values: META_VALUE

Attention

If you altered your original column names in input:, e.g. with slugify_columns: or lowercase_columns:, then you must use those altered column names now. If you try to use the original names, you’ll get an error.

Remember, the order for these blocks does not matter. Even if your input: block is at the end of your file, it will still be processed first.

How Does Pivoting Work?#

Pivoting is easiest to understand with an example.

In WordPress, there’s a table called postmeta, used for storing extra meta information about the posts. In theory, the designers could have tried to predict every kind of information that anyone would ever want to attach to their posts… plus every third-party plugin that would ever be written… and then made sure to add enough columns to cover every possible use case.

Right.

Instead, postmeta is extremely flexible. Each row has only four columns:

  • META_ID

  • POST_ID

  • META_KEY

  • META_VALUE

A few random rows from this table might look like this:

META_ID

POST_ID

META_KEY

META_VALUE

1000

42

export_format

pdf

1001

42

shareability

97

1002

42

oxford_commas

18

1003

43

export_format

docx

1004

43

puppy_mentions

0

1005

43

shareability

31

1006

44

export_format

calligraphy

1007

44

unironic

1

As you can see, each row is bound to a particular post by POST_ID, and META_KEY is basically a column name.

But trying to query this table would be a pain. It would be much easier if the data looked like this:

POST_ID

export_format

shareability

oxford_commas

puppy_mentions

unironic

42

pdf

97

18

NULL

NULL

43

docx

31

NULL

0

NULL

44

calligraphy

NULL

NULL

NULL

1

Which is what this bit of pivot: config will do:

tables_config:
  TABLE_NAME_A:
    - path: SOURCE_A.xlsx
      pivot:
        index: POST_ID
        columns: META_KEY
        values: META_VALUE

pivot: and include_index:#

When a source is pivoted, the ref:include_index: <table-include-index> for the entire table is automatically set to true. This is because the index column was present in your original data, so you probably expect to see it in the output.

If you don’t want to see that index column, add include_index: false to this source.

It’s not enough to set the overall include_index to false up in input:, because that already defaults to false, except for pivot tables.

A Pivot Only Affects This Source#

If your table includes multiple sources, only those sources with a pivot: block will be pivoted.

But you need to ensure that the non-pivoted source and the pivoted source (after its pivot) share at least one column, so they can merge. See Multiple Sources Get Merged.

include_index:#

Optional. Override the overall value of include_index: for this table.

true

For this table, include a separate column for the index in the final output.

false

For this table, Silently omit the index column.

Important

Unlike other keys in a source, this key affects the entire table. So you can only set it once in any particular table. It doesn’t matter which source you choose. If you try to set include_index: on more than one source in a table, you’ll get an error.

datetime:#

Optional. One or more columns that should be converted to datetime format.

Each column should be added as a key.

If you omit any value for the key, then the column will be converted to the default datetime format.

If you include a format string as the value, the column will formatted with that string. (See Formatting Codes.)

tables_config:
  TABLE_NAME_A:
    - path: SOURCE_A.xlsx
      datetime:
        # You can supply a custom format string.
        COLUMN_1: "%Y-%m"
        # To use default datetime format, omit format string.
        COLUMN_2:
        # Spaces or punctuation in the column name? Add quotes.
        "COLUMN 3":

Important

Because these are keys in the same block, each column name must be unique.

Attention

If you altered your original column names in input:, e.g. with slugify_columns: or lowercase_columns:, then you must use those altered column names now. If you try to use the original names, you’ll get an error.

Remember, the order for these blocks does not matter. Even if your input: block is at the end of your file, it will still be processed first.

Warning

Similarly, if you pivot: the table, you will get a whole new set of columns than the original data. You need to use those new** column names here, not the column names in the original data.

import:#

Optional, but you need it if you set a postprocess: function for a query.

import:
  - path: MODULE_A.py
  - path: MODULE_B.py

Warning

If more than one module in this list defines the same function, the later module in the list will silently override the previous definition.

This may be desired behavior, but only if you expect it.

path:#

REQUIRED. Path to Python module (.py file) where you have defined your postprocess: function(s).

Important

This path is relative to your config file. See Are You in the Right Directory?

Note

You can define more than one function in a file, so you normally will only need one path:.

queries:#

REQUIRED, unless you set output: export_csv. (You need to output something.)

queries:
  - name: QUERY A
    sql: SELECT * FROM table_from_spreadsheet AS s;

  - name: QUERY B
    # For the SQL, you can use a multiline string for readability.
    sql: >
      SELECT
      *
      FROM
      table_from_spreadsheet AS s
      JOIN
      table_from_csv AS c
      ON
      s.id = c.id
      ;
    replace:
      COLUMN_A:
        MATCH A1: REPLACE A1
        # You may want to quote strings with spaces and punctuation.
        "MATCH A2": "REPLACE A2"
      COLUMN_B:
        MATCH B1: REPLACE B1
    postprocess: postprocess_function

Warning

Unlike tables in tables_config:, each query is a list item. See Each Query is a List Item. The order in which you list the queries matters, at least if you want a later query to refer to an earlier query. The order of keys within each query does not matter, because the keys are always processed in the same order. See When Does Order Matter?

In this block, you define one or more queries.

Each query will be output to either a single CSV file or a sheet in a spreadsheet.

Within each query, some keys are REQUIRED, while others are Optional.

Query Order of Operations#

No matter what order you place these keys, the operations run in this order:

  1. The statement in sql: is run, generating the query result.

  2. If defined, the replace: items are processed.

  3. Last of all, if it’s defined, the data is run through the postprocess: function.

See also

Query operations order is set in: yarm.queries.query_options()

name:#

REQUIRED. Name for this query.

The name is used:

  • As the name of the database table that holds this query.

  • When output to a spreadsheet: as the name of the sheet that contains this query.

  • When output to CSV: as the filename for this CSV.

Important

Use a unique name for each query. Do not use a name that you’ve already used for another query, or even a table in tables_config: Conflicting names will cause problems for the internal database. Case does not matter here, so QUERY A and query A would conflict.

sql:#

REQUIRED. The SQL statement for this query.

For readability, this can be a multiline string. See Watch Out for the Multiline String in sql:.

Warning

If you have altered your data in any earlier options, such as input: or tables_config:, make sure you query this altered data.

For instance, if you used pivot:, you need to SELECT from the new columns you’ve created, not the columns in the original data.

By contrast, all other options in this block run after this SQL statement. If you use replace: or postprocess:, they can only operate on

replace:#

Optional. After you have run the query in sql:, you can find and replace data within particular columns.

queries:
  - name: QUERY A
    sql: SELECT * FROM table_from_spreadsheet AS s;

  - name: QUERY B
    # For the SQL, you can use a multiline string for readability.
    sql: >
      SELECT
      *
      FROM
      table_from_spreadsheet AS s
      JOIN
      table_from_csv AS c
      ON
      s.id = c.id
      ;
    replace:
      COLUMN_A:
        MATCH A1: REPLACE A1
        # You may want to quote strings with spaces and punctuation.
        "MATCH A2": "REPLACE A2"
      COLUMN_B:
        MATCH B1: REPLACE B1
    postprocess: postprocess_function

Important

Each match and replace happens within one column. If you need to find and replace the same patterns across multiple columns, you will need to define them separately for each column.

Warning

If you have altered the data at any point (e.g. uppercase_rows:), you’ll need to match that altered data here.

Note that in this block, we do not use any lists.

Each column is a key, and within each column, each match pattern is a key.

Replacements Can Occur In Any Order#

Because we’re using matches as keys, not list items, replacements can occur in any order.

This has not been a problem for me yet, because I haven’t needed to do cascading replacements that depend on earlier replacements.

This block is intended for casual, cosmetic replacements for reports, like replacing member_level1 with Bronze.

If your needs are more complex, you may want to write a postprocess function.

That said, if I get feedback that a strict order of replacements is desirable, I’m open to adding the extra code to allow processing this block as a list (or list of lists).

postprocess:#

Optional. After you have run the query in sql: and applied any other alterations (currently just replace:), run the data through this custom function you have defined.

Important

This function must be defined in one of the .py files imported with import:

See also

For full details and examples on writing a postprocess function, see: Postprocess with Custom Code