Configuration Options#

This page explains each configuration option in detail.

For documentation on actual functions, see the Reference.

Overview#

Remember, the basic steps for your configuration file are:

output: Set up your output options.
input: (Optional) Set any input options, for changes you want to make to data after you input it, but before you run your queries.
tables_config: Define your tables you want to query.
import: (Optional) If you want to apply a custom Python function to any of your queries, import your code file(s).
queries: Define your queries.

Does Order Matter?#

Order does not matter for keys, including top-level keys like output:. You can put the output: block at the top, at the end, wherever you want.

Within a block, you can order the keys however you want. E.g., each table in tables_config: is a key, so it doesn’t matter which table you define first.

But order does matter for list items. E.g., within import:, it does matter which order you import sources. The order is up to you, and for most reports, the order probably won’t affect the outcome. You just need to be aware that list items are necessarily processed in the order you write them, while keys aren’t.

See When Does Order Matter?

`output:`#

REQUIRED.

Configures your final report.

output:
  dir: output
  basename: BASENAME
  export_tables: csv
  export_queries: csv
  styles:
    column_width: 15

Note

Defined in: yarm.validate.validate_key_output()

`dir`:#

REQUIRED. The output directory for your report and other exports.

If the directory does not exist, it will be created.

Important

This path is relative to your config file. See Are You in the Right Directory?

Tip

When you rerun a report, you will usually need to overwrite the output from your last report. If you don’t want to manually confirm each overwrite, add -f or --force.

`basename`:#

REQUIRED. The basename for your output file(s).

The way this basename is used depends on your export choices. See export_tables: and export_queries: below.

`export_tables`:#

Optional. Output format for your tables.

If omitted, no tables are exported.

csv: Export each table as a separate CSV file, named with the table’s name: key.
xlsx: Export each table as a separate sheet in a single spreadsheet named tables.xlsx.

All files are exported to the output directory set in dir: above.

`export_queries`:#

Optional. Output format for your queries.

If omitted, defaults to xlsx.

csv: Export each query as a separate CSV file.
xlsx: Export each query as a separate sheet in a single spreadsheet named with basename: (see above) and extension xlsx.

`styles`:#

Optional. Options for formatting your output.

`column_width`:#

Optional. Set the width for columns in your output spreadsheet.

This should be an integer.

If omitted, this will be an internal default.

TODO#

Add more output styling options?

`input:`#

Optional. Options for modifying your input tables before you run the queries.

input:
  strip: false
  slugify_columns: false
  lowercase_columns: false
  uppercase_rows: false
  include_index: false

Note

Defined in: yarm.validate.validate_key_input()

Important

In general, if you omit any of these options, your data will be unchanged. Yarm only alters your data if you tell it to.

Since you can omit this entire block, all options in this section are optional.

`strip:`#

If omitted, defaults to false.

true: Strip all whitespace from beginning and end of each data cell.
false: Data is unchanged.

`slugify_columns:`#

If omitted, defaults to false.

true: “Slugify” all column names. This replaces all punctuation and spaces with an underscore. Rows stay unchanged.
false: Data is unchanged.

See also

This option does not change case. To also lowercase your columns, see lowercase_columns:

Attention

If you alter your column names now, be sure to use these altered names later, not only in your queries: but also if you reference any columns in tables_config:, i.e. with datetime: or pivot:.

`lowercase_columns:`#

If omitted, defaults to false.

true: Lowercase all column names. Rows stay unchanged.
false: Data is unchanged.

See also

You may also want to use slugify_columns:

Attention

If you alter your column names now, be sure to use these altered names later, not only in your queries: but also if you reference any columns in tables_config:, i.e. with datetime: or pivot:.

`uppercase_rows:`#

If omitted, defaults to false.

true: UPPERCASE the data in all rows. Column names stay unchanged.
false: Data is unchanged.

Note

Uppercasing makes your data more consistent, which can be useful for data like addresses.

`include_index:`#

If omitted, defaults to false, except for tables which use pivot:. Pivot tables default to true (see below.)

true: Include a separate column for the index in the final output.
false: Silently omit the index column.

Important

You can override this value for each particular table in tables_config:.

When each table is created in the database, an index column is also created. Usually, this index is added automatically, so it’s confusing and unnecessary to see it in the final output.

Pivot tables are different, because in order to pivot a table, we manually set one of the original columns as the index. In this case, we default to true, so that this original column is kept in our final output.

`tables_config:`#

REQUIRED. Define one or more tables of source data.

tables_config:
  TABLE_NAME_A:
    - path: SOURCE_A1.csv
      include_index: false
    - path: SOURCE_A2.csv
  TABLE_NAME_B:
    - path: SOURCE_B.xlsx
      sheet: B.1
      pivot:
        index: ID_COLUMN
        columns: KEY_COLUMN
        values: VALUE_COLUMN
      datetime:
        # You can supply a custom format string.
        COLUMN_1: "%Y-%m"
        # To use default datetime format, omit format string.
        COLUMN_2:
        # Spaces or punctuation in the column name? Add quotes.
        "COLUMN 3":

Note

Defined in: yarm.validate.validate_key_tables_config()

Each Table Needs a Unique Name#

In this example, we define two tables, TABLE_NAME_A and TABLE_NAME_B. (Your tables will probably have slightly more normal names, like Customers.)

Each table name must be unique, distinct not only from other tables but also from your queries (see query:name:).

Avoid spaces and punctuation. You’ll be referencing these names in your queries, and if you don’t use simple table names, you’ll need to quote them in the query, which is a needless hassle.

Each Table Is a List of Sources#

Underneath each table, you have a list of one or more sources.

Each source is one table of data. Such as:

A CSV file
One sheet in a spreadsheet

Most tables probably only need one source. You take a CSV file or a sheet in a spreadsheet, and you turn it into a table. Simple.

Multiple Sources Get Merged#

But sometimes you want to combine two or more CSVs or sheets into a single table. You can do this, as long as all the sources share at least one column in common.

Behind the scenes, each source gets merged as an outer join (see yarm.tables.input_source()), and this requires a shared column.

In the example config above, both SOURCE_B1.csv and SOURCE_B2.csv get merged into a single table, TABLE_NAME_B.

Source Options#

Each source must have, at minimum, a path:. But you have other options as well.

Remember, each source is a separate list item beneath its table.

Table Source Order of Operations#

No matter what order you place these keys within each table source, the operations run in this order:

The file in path: (and, for a spreadsheet, the sheet: for this path), is read into the database.
If pivot: is defined, the table is pivoted. (This will usually create many new columns.)
If datetime: is defined, those columns are formatted.

See also

Table operations order is set in: yarm.tables.df_tables_config_options()

`path:`#

REQUIRED. Path to source file.

Important

This path is relative to your config file. See Are You in the Right Directory?

Valid file formats for source paths:#

.csv
.xlsx

`sheet:`#

Optional. Name or number of sheet in spreadsheet.

Only relevant for spreadsheets.

If omitted, the first sheet will be used.

Important

Normally, you’ll probably use the name of the spreadsheet. If you opt to use the number, note that numbering starts at 0. The first sheet is 0, the second is 1, etc.

`pivot:`#

Optional. Pivot a table.

tables_config:
  TABLE_NAME_A:
    - path: SOURCE_A.xlsx
      pivot:
        index: POST_ID
        columns: META_KEY
        values: META_VALUE

Attention

If you altered your original column names in input:, e.g. with slugify_columns: or lowercase_columns:, then you must use those altered column names now. If you try to use the original names, you’ll get an error.

Remember, the order for these blocks does not matter. Even if your input: block is at the end of your file, it will still be processed first.

How Does Pivoting Work?#

Pivoting is easiest to understand with an example.

In WordPress, there’s a table called postmeta, used for storing extra meta information about the posts. In theory, the designers could have tried to predict every kind of information that anyone would ever want to attach to their posts… plus every third-party plugin that would ever be written… and then made sure to add enough columns to cover every possible use case.

Right.

Instead, postmeta is extremely flexible. Each row has only four columns:

META_ID
POST_ID
META_KEY
META_VALUE

A few random rows from this table might look like this:

META_ID	POST_ID	META_KEY	META_VALUE
1000	42	export_format	pdf
1001	42	shareability	97
1002	42	oxford_commas	18
1003	43	export_format	docx
1004	43	puppy_mentions	0
1005	43	shareability	31
1006	44	export_format	calligraphy
1007	44	unironic	1

As you can see, each row is bound to a particular post by POST_ID, and META_KEY is basically a column name.

But trying to query this table would be a pain. It would be much easier if the data looked like this:

POST_ID	export_format	shareability	oxford_commas	puppy_mentions	unironic
42	pdf	97	18	NULL	NULL
43	docx	31	NULL	0	NULL
44	calligraphy	NULL	NULL	NULL	1

Which is what this bit of pivot: config will do:

tables_config:
  TABLE_NAME_A:
    - path: SOURCE_A.xlsx
      pivot:
        index: POST_ID
        columns: META_KEY
        values: META_VALUE

`pivot:` and `include_index:`#

When a source is pivoted, the ref:include_index: <table-include-index> for the entire table is automatically set to true. This is because the index column was present in your original data, so you probably expect to see it in the output.

If you don’t want to see that index column, add include_index: false to this source.

It’s not enough to set the overall include_index to false up in input:, because that already defaults to false, except for pivot tables.

A Pivot Only Affects This Source#

If your table includes multiple sources, only those sources with a pivot: block will be pivoted.

But you need to ensure that the non-pivoted source and the pivoted source (after its pivot) share at least one column, so they can merge. See Multiple Sources Get Merged.

`include_index:`#

Optional. Override the overall value of include_index: for this table.

true: For this table, include a separate column for the index in the final output.
false: For this table, Silently omit the index column.

Important

Unlike other keys in a source, this key affects the entire table. So you can only set it once in any particular table. It doesn’t matter which source you choose. If you try to set include_index: on more than one source in a table, you’ll get an error.

`datetime:`#

Optional. One or more columns that should be converted to datetime format.

Each column should be added as a key.

If you omit any value for the key, then the column will be converted to the default datetime format.

If you include a format string as the value, the column will formatted with that string. (See Formatting Codes.)

tables_config:
  TABLE_NAME_A:
    - path: SOURCE_A.xlsx
      datetime:
        # You can supply a custom format string.
        COLUMN_1: "%Y-%m"
        # To use default datetime format, omit format string.
        COLUMN_2:
        # Spaces or punctuation in the column name? Add quotes.
        "COLUMN 3":

Important

Because these are keys in the same block, each column name must be unique.

Attention

Remember, the order for these blocks does not matter. Even if your input: block is at the end of your file, it will still be processed first.

Warning

Similarly, if you pivot: the table, you will get a whole new set of columns than the original data. You need to use those new** column names here, not the column names in the original data.

`import:`#

Optional, but you need it if you set a postprocess: function for a query.

import:
  - path: MODULE_A.py
  - path: MODULE_B.py

Note

Defined in: yarm.validate.validate_key_import()

Warning

If more than one module in this list defines the same function, the later module in the list will silently override the previous definition.

This may be desired behavior, but only if you expect it.

`path:`#

REQUIRED. Path to Python module (.py file) where you have defined your postprocess: function(s).

Important

This path is relative to your config file. See Are You in the Right Directory?

Note

You can define more than one function in a file, so you normally will only need one path:.

`queries:`#

REQUIRED, unless you set output: export_csv. (You need to output something.)

queries:
  - name: QUERY A
    sql: SELECT * FROM table_from_spreadsheet AS s;

  - name: QUERY B
    # For the SQL, you can use a multiline string for readability.
    sql: >
      SELECT
      *
      FROM
      table_from_spreadsheet AS s
      JOIN
      table_from_csv AS c
      ON
      s.id = c.id
      ;
    replace:
      COLUMN_A:
        MATCH A1: REPLACE A1
        # You may want to quote strings with spaces and punctuation.
        "MATCH A2": "REPLACE A2"
      COLUMN_B:
        MATCH B1: REPLACE B1
    postprocess: postprocess_function

Note

Defined in: yarm.validate.validate_key_queries()

Warning

Unlike tables in tables_config:, each query is a list item. See Each Query is a List Item. The order in which you list the queries matters, at least if you want a later query to refer to an earlier query. The order of keys within each query does not matter, because the keys are always processed in the same order. See When Does Order Matter?

In this block, you define one or more queries.

Each query will be output to either a single CSV file or a sheet in a spreadsheet.

Within each query, some keys are REQUIRED, while others are Optional.

Query Order of Operations#

No matter what order you place these keys, the operations run in this order:

The statement in sql: is run, generating the query result.
If defined, the replace: items are processed.
Last of all, if it’s defined, the data is run through the postprocess: function.

See also

Query operations order is set in: yarm.queries.query_options()

`name:`#

REQUIRED. Name for this query.

The name is used:

As the name of the database table that holds this query.
When output to a spreadsheet: as the name of the sheet that contains this query.
When output to CSV: as the filename for this CSV.

Important

Use a unique name for each query. Do not use a name that you’ve already used for another query, or even a table in tables_config: Conflicting names will cause problems for the internal database. Case does not matter here, so QUERY A and query A would conflict.

`sql:`#

REQUIRED. The SQL statement for this query.

For readability, this can be a multiline string. See Watch Out for the Multiline String in sql:.

Warning

If you have altered your data in any earlier options, such as input: or tables_config:, make sure you query this altered data.

For instance, if you used pivot:, you need to SELECT from the new columns you’ve created, not the columns in the original data.

By contrast, all other options in this block run after this SQL statement. If you use replace: or postprocess:, they can only operate on

`replace:`#

Optional. After you have run the query in sql:, you can find and replace data within particular columns.

queries:
  - name: QUERY A
    sql: SELECT * FROM table_from_spreadsheet AS s;

  - name: QUERY B
    # For the SQL, you can use a multiline string for readability.
    sql: >
      SELECT
      *
      FROM
      table_from_spreadsheet AS s
      JOIN
      table_from_csv AS c
      ON
      s.id = c.id
      ;
    replace:
      COLUMN_A:
        MATCH A1: REPLACE A1
        # You may want to quote strings with spaces and punctuation.
        "MATCH A2": "REPLACE A2"
      COLUMN_B:
        MATCH B1: REPLACE B1
    postprocess: postprocess_function

Important

Each match and replace happens within one column. If you need to find and replace the same patterns across multiple columns, you will need to define them separately for each column.

Warning

If you have altered the data at any point (e.g. uppercase_rows:), you’ll need to match that altered data here.

Note that in this block, we do not use any lists.

Each column is a key, and within each column, each match pattern is a key.

Replacements Can Occur In Any Order#

Because we’re using matches as keys, not list items, replacements can occur in any order.

This has not been a problem for me yet, because I haven’t needed to do cascading replacements that depend on earlier replacements.

This block is intended for casual, cosmetic replacements for reports, like replacing member_level1 with Bronze.

If your needs are more complex, you may want to write a postprocess function.

That said, if I get feedback that a strict order of replacements is desirable, I’m open to adding the extra code to allow processing this block as a list (or list of lists).

`postprocess:`#

Optional. After you have run the query in sql: and applied any other alterations (currently just replace:), run the data through this custom function you have defined.

Important

This function must be defined in one of the .py files imported with import:

See also

For full details and examples on writing a postprocess function, see: Postprocess with Custom Code

Configuration Options#

Overview#

Does Order Matter?#

output:#

dir:#

basename:#

export_tables:#

export_queries:#

styles:#

column_width:#

TODO#

input:#

strip:#

slugify_columns:#

lowercase_columns:#

uppercase_rows:#

include_index:#

tables_config:#

Each Table Needs a Unique Name#

Each Table Is a List of Sources#

Multiple Sources Get Merged#

Source Options#

Table Source Order of Operations#

path:#

Valid file formats for source paths:#

sheet:#

pivot:#

How Does Pivoting Work?#

pivot: and include_index:#

A Pivot Only Affects This Source#

include_index:#

datetime:#

import:#

path:#

queries:#

Query Order of Operations#

name:#

sql:#

replace:#

Replacements Can Occur In Any Order#

postprocess:#

`output:`#

`dir`:#

`basename`:#

`export_tables`:#

`export_queries`:#

`styles`:#

`column_width`:#

`input:`#

`strip:`#

`slugify_columns:`#

`lowercase_columns:`#

`uppercase_rows:`#

`include_index:`#

`tables_config:`#

`path:`#

`sheet:`#

`pivot:`#

`pivot:` and `include_index:`#

`include_index:`#

`datetime:`#

`import:`#

`path:`#

`queries:`#

`name:`#

`sql:`#

`replace:`#

`postprocess:`#