Configuration Options#
This page explains each configuration option in detail.
For documentation on actual functions, see the Reference.
Overview#
Remember, the basic steps for your configuration file are:
output:Set up your output options.input:(Optional) Set any input options, for changes you want to make to data after you input it, but before you run your queries.tables_config:Define your tables you want to query.import:(Optional) If you want to apply a custom Python function to any of your queries, import your code file(s).queries:Define your queries.
Does Order Matter?#
Order does not matter for keys, including top-level keys like output:. You can put the output: block at the top, at the end, wherever you want.
Within a block, you can order the keys however you want. E.g., each table in tables_config: is a key, so it doesn’t matter which table you define first.
But order does matter for list items. E.g., within import:, it does matter which order you import sources. The order is up to you, and for most reports, the order probably won’t affect the outcome. You just need to be aware that list items are necessarily processed in the order you write them, while keys aren’t.
output:#
REQUIRED.
Configures your final report.
output:
dir: output
basename: BASENAME
export_tables: csv
export_queries: csv
styles:
column_width: 15
Note
Defined in: yarm.validate.validate_key_output()
dir:#
REQUIRED. The output directory for your report and other exports.
If the directory does not exist, it will be created.
Important
This path is relative to your config file. See Are You in the Right Directory?
Tip
When you rerun a report, you will usually need to overwrite the output from your last report.
If you don’t want to manually confirm each overwrite, add -f or --force.
basename:#
REQUIRED. The basename for your output file(s).
The way this basename is used depends on your export choices. See export_tables: and export_queries: below.
export_tables:#
Optional. Output format for your tables.
If omitted, no tables are exported.
csvExport each table as a separate CSV file, named with the table’s
name:key.xlsxExport each table as a separate sheet in a single spreadsheet named
tables.xlsx.
All files are exported to the output directory set in dir: above.
export_queries:#
Optional. Output format for your queries.
If omitted, defaults to xlsx.
csvExport each query as a separate CSV file.
xlsxExport each query as a separate sheet in a single spreadsheet named with basename: (see above) and extension
xlsx.
styles:#
Optional. Options for formatting your output.
column_width:#
Optional. Set the width for columns in your output spreadsheet.
This should be an integer.
If omitted, this will be an internal default.
TODO#
Add more output styling options?
input:#
Optional. Options for modifying your input tables before you run the queries.
input:
strip: false
slugify_columns: false
lowercase_columns: false
uppercase_rows: false
include_index: false
Note
Defined in: yarm.validate.validate_key_input()
Important
In general, if you omit any of these options, your data will be unchanged. Yarm only alters your data if you tell it to.
Since you can omit this entire block, all options in this section are optional.
strip:#
If omitted, defaults to false.
trueStrip all whitespace from beginning and end of each data cell.
falseData is unchanged.
slugify_columns:#
If omitted, defaults to false.
true“Slugify” all column names. This replaces all punctuation and spaces with an underscore. Rows stay unchanged.
falseData is unchanged.
See also
This option does not change case. To also lowercase your columns, see lowercase_columns:
Attention
If you alter your column names now, be sure to use these altered names later, not only in your queries: but also if you reference any columns in tables_config:, i.e. with datetime: or pivot:.
lowercase_columns:#
If omitted, defaults to false.
trueLowercase all column names. Rows stay unchanged.
falseData is unchanged.
See also
You may also want to use slugify_columns:
Attention
If you alter your column names now, be sure to use these altered names later, not only in your queries: but also if you reference any columns in tables_config:, i.e. with datetime: or pivot:.
uppercase_rows:#
If omitted, defaults to false.
trueUPPERCASE the data in all rows. Column names stay unchanged.
falseData is unchanged.
Note
Uppercasing makes your data more consistent, which can be useful for data like addresses.
include_index:#
If omitted, defaults to false, except for tables which use pivot:. Pivot tables default to true (see below.)
trueInclude a separate column for the index in the final output.
falseSilently omit the index column.
Important
You can override this value for each particular table in tables_config:.
When each table is created in the database, an index column is also created. Usually, this index is added automatically, so it’s confusing and unnecessary to see it in the final output.
Pivot tables are different, because in order to pivot a table, we manually set one of the original columns as the index. In this case, we default to true, so that this original column is kept in our final output.
tables_config:#
REQUIRED. Define one or more tables of source data.
tables_config:
TABLE_NAME_A:
- path: SOURCE_A1.csv
include_index: false
- path: SOURCE_A2.csv
TABLE_NAME_B:
- path: SOURCE_B.xlsx
sheet: B.1
pivot:
index: ID_COLUMN
columns: KEY_COLUMN
values: VALUE_COLUMN
datetime:
# You can supply a custom format string.
COLUMN_1: "%Y-%m"
# To use default datetime format, omit format string.
COLUMN_2:
# Spaces or punctuation in the column name? Add quotes.
"COLUMN 3":
Note
Defined in: yarm.validate.validate_key_tables_config()
Each Table Needs a Unique Name#
In this example, we define two tables, TABLE_NAME_A and TABLE_NAME_B. (Your tables will probably have slightly more normal names, like Customers.)
Each table name must be unique, distinct not only from other tables but also from your queries (see query:name:).
Avoid spaces and punctuation. You’ll be referencing these names in your queries, and if you don’t use simple table names, you’ll need to quote them in the query, which is a needless hassle.
Each Table Is a List of Sources#
Underneath each table, you have a list of one or more sources.
Each source is one table of data. Such as:
A CSV file
One sheet in a spreadsheet
Most tables probably only need one source. You take a CSV file or a sheet in a spreadsheet, and you turn it into a table. Simple.
Multiple Sources Get Merged#
But sometimes you want to combine two or more CSVs or sheets into a single table. You can do this, as long as all the sources share at least one column in common.
Behind the scenes, each source gets merged as an outer join (see yarm.tables.input_source()), and this requires a shared column.
In the example config above, both SOURCE_B1.csv and SOURCE_B2.csv get merged into a single table, TABLE_NAME_B.
Source Options#
Each source must have, at minimum, a path:. But you have other options as well.
Remember, each source is a separate list item beneath its table.
Table Source Order of Operations#
No matter what order you place these keys within each table source, the operations run in this order:
The file in path: (and, for a spreadsheet, the sheet: for this path), is read into the database.
If pivot: is defined, the table is pivoted. (This will usually create many new columns.)
If datetime: is defined, those columns are formatted.
See also
Table operations order is set in: yarm.tables.df_tables_config_options()
path:#
REQUIRED. Path to source file.
Important
This path is relative to your config file. See Are You in the Right Directory?
Valid file formats for source paths:#
.csv.xlsx
sheet:#
Optional. Name or number of sheet in spreadsheet.
Only relevant for spreadsheets.
If omitted, the first sheet will be used.
Important
Normally, you’ll probably use the name of the spreadsheet.
If you opt to use the number, note that numbering starts at 0.
The first sheet is 0, the second is 1, etc.
pivot:#
Optional. Pivot a table.
tables_config:
TABLE_NAME_A:
- path: SOURCE_A.xlsx
pivot:
index: POST_ID
columns: META_KEY
values: META_VALUE
Attention
If you altered your original column names in input:, e.g. with slugify_columns: or lowercase_columns:, then you must use those altered column names now. If you try to use the original names, you’ll get an error.
Remember, the order for these blocks does not matter. Even if your input: block is at the end of your file, it will still be processed first.
How Does Pivoting Work?#
Pivoting is easiest to understand with an example.
In WordPress, there’s a table called postmeta, used for storing extra meta information about the posts. In theory, the designers could have tried to predict every kind of information that anyone would ever want to attach to their posts… plus every third-party plugin that would ever be written… and then made sure to add enough columns to cover every possible use case.
Right.
Instead, postmeta is extremely flexible. Each row has only four columns:
META_IDPOST_IDMETA_KEYMETA_VALUE
A few random rows from this table might look like this:
META_ID |
POST_ID |
META_KEY |
META_VALUE |
|---|---|---|---|
1000 |
42 |
export_format |
|
1001 |
42 |
shareability |
97 |
1002 |
42 |
oxford_commas |
18 |
1003 |
43 |
export_format |
docx |
1004 |
43 |
puppy_mentions |
0 |
1005 |
43 |
shareability |
31 |
1006 |
44 |
export_format |
calligraphy |
1007 |
44 |
unironic |
1 |
As you can see, each row is bound to a particular post by POST_ID, and META_KEY is basically a column name.
But trying to query this table would be a pain. It would be much easier if the data looked like this:
POST_ID |
export_format |
shareability |
oxford_commas |
puppy_mentions |
unironic |
|---|---|---|---|---|---|
42 |
97 |
18 |
NULL |
NULL |
|
43 |
docx |
31 |
NULL |
0 |
NULL |
44 |
calligraphy |
NULL |
NULL |
NULL |
1 |
Which is what this bit of pivot: config will do:
tables_config:
TABLE_NAME_A:
- path: SOURCE_A.xlsx
pivot:
index: POST_ID
columns: META_KEY
values: META_VALUE
pivot: and include_index:#
When a source is pivoted, the ref:include_index: <table-include-index> for the entire table
is automatically set to true. This is because the index column was present in your original
data, so you probably expect to see it in the output.
If you don’t want to see that index column, add include_index: false to this source.
It’s not enough to set the overall include_index to false up in
input:, because that already defaults to false,
except for pivot tables.
A Pivot Only Affects This Source#
If your table includes multiple sources, only those sources with a pivot: block will be pivoted.
But you need to ensure that the non-pivoted source and the pivoted source (after its pivot) share at least one column, so they can merge. See Multiple Sources Get Merged.
include_index:#
Optional. Override the overall value of include_index: for this table.
trueFor this table, include a separate column for the index in the final output.
falseFor this table, Silently omit the index column.
Important
Unlike other keys in a source, this key affects the entire table. So you can only set it
once in any particular table. It doesn’t matter which source you choose.
If you try to set include_index: on more than one source in a table, you’ll get an error.
datetime:#
Optional. One or more columns that should be converted to datetime format.
Each column should be added as a key.
If you omit any value for the key, then the column will be converted to the default datetime format.
If you include a format string as the value, the column will formatted with that string. (See Formatting Codes.)
tables_config:
TABLE_NAME_A:
- path: SOURCE_A.xlsx
datetime:
# You can supply a custom format string.
COLUMN_1: "%Y-%m"
# To use default datetime format, omit format string.
COLUMN_2:
# Spaces or punctuation in the column name? Add quotes.
"COLUMN 3":
Important
Because these are keys in the same block, each column name must be unique.
Attention
If you altered your original column names in input:, e.g. with slugify_columns: or lowercase_columns:, then you must use those altered column names now. If you try to use the original names, you’ll get an error.
Remember, the order for these blocks does not matter. Even if your input: block is at the end of your file, it will still be processed first.
Warning
Similarly, if you pivot: the table, you will get a whole new set of columns than the original data. You need to use those new** column names here, not the column names in the original data.
import:#
Optional, but you need it if you set a postprocess: function for a query.
import:
- path: MODULE_A.py
- path: MODULE_B.py
Note
Defined in: yarm.validate.validate_key_import()
Warning
If more than one module in this list defines the same function, the later module in the list will silently override the previous definition.
This may be desired behavior, but only if you expect it.
path:#
REQUIRED. Path to Python module (.py file) where you have defined your postprocess: function(s).
Important
This path is relative to your config file. See Are You in the Right Directory?
Note
You can define more than one function in a file, so you normally will only need one path:.
queries:#
REQUIRED, unless you set output: export_csv. (You need to output something.)
queries:
- name: QUERY A
sql: SELECT * FROM table_from_spreadsheet AS s;
- name: QUERY B
# For the SQL, you can use a multiline string for readability.
sql: >
SELECT
*
FROM
table_from_spreadsheet AS s
JOIN
table_from_csv AS c
ON
s.id = c.id
;
replace:
COLUMN_A:
MATCH A1: REPLACE A1
# You may want to quote strings with spaces and punctuation.
"MATCH A2": "REPLACE A2"
COLUMN_B:
MATCH B1: REPLACE B1
postprocess: postprocess_function
Note
Defined in: yarm.validate.validate_key_queries()
Warning
Unlike tables in tables_config:, each query is a list item. See Each Query is a List Item. The order in which you list the queries matters, at least if you want a later query to refer to an earlier query. The order of keys within each query does not matter, because the keys are always processed in the same order. See When Does Order Matter?
In this block, you define one or more queries.
Each query will be output to either a single CSV file or a sheet in a spreadsheet.
Within each query, some keys are REQUIRED, while others are Optional.
Query Order of Operations#
No matter what order you place these keys, the operations run in this order:
The statement in sql: is run, generating the query result.
If defined, the replace: items are processed.
Last of all, if it’s defined, the data is run through the postprocess: function.
See also
Query operations order is set in: yarm.queries.query_options()
name:#
REQUIRED. Name for this query.
The name is used:
As the name of the database table that holds this query.
When output to a spreadsheet: as the name of the sheet that contains this query.
When output to CSV: as the filename for this CSV.
Important
Use a unique name for each query. Do not use a name that you’ve already used for
another query, or even a table in tables_config: Conflicting names will cause problems
for the internal database. Case does not matter here, so QUERY A and query A
would conflict.
sql:#
REQUIRED. The SQL statement for this query.
For readability, this can be a multiline string. See Watch Out for the Multiline String in sql:.
Warning
If you have altered your data in any earlier options, such as input: or tables_config:, make sure you query this altered data.
For instance, if you used pivot:, you need to SELECT from the new columns you’ve created, not the columns in the original data.
By contrast, all other options in this block run after this SQL statement. If you use replace: or postprocess:, they can only operate on
replace:#
Optional. After you have run the query in sql:, you can find and replace data within particular columns.
queries:
- name: QUERY A
sql: SELECT * FROM table_from_spreadsheet AS s;
- name: QUERY B
# For the SQL, you can use a multiline string for readability.
sql: >
SELECT
*
FROM
table_from_spreadsheet AS s
JOIN
table_from_csv AS c
ON
s.id = c.id
;
replace:
COLUMN_A:
MATCH A1: REPLACE A1
# You may want to quote strings with spaces and punctuation.
"MATCH A2": "REPLACE A2"
COLUMN_B:
MATCH B1: REPLACE B1
postprocess: postprocess_function
Important
Each match and replace happens within one column. If you need to find and replace the same patterns across multiple columns, you will need to define them separately for each column.
Warning
If you have altered the data at any point (e.g. uppercase_rows:), you’ll need to match that altered data here.
Note that in this block, we do not use any lists.
Each column is a key, and within each column, each match pattern is a key.
Replacements Can Occur In Any Order#
Because we’re using matches as keys, not list items, replacements can occur in any order.
This has not been a problem for me yet, because I haven’t needed to do cascading replacements that depend on earlier replacements.
This block is intended for casual, cosmetic replacements for reports, like replacing member_level1 with Bronze.
If your needs are more complex, you may want to write a postprocess function.
That said, if I get feedback that a strict order of replacements is desirable, I’m open to adding the extra code to allow processing this block as a list (or list of lists).
postprocess:#
Optional. After you have run the query in sql: and applied any other alterations (currently just replace:), run the data through this custom function you have defined.
Important
This function must be defined in one of the .py files imported with import:
See also
For full details and examples on writing a postprocess function, see: Postprocess with Custom Code