spdxconv
spdxconv is a program to convert existing licenses and copyrights into SPDX identifiers or insert new ones.
This program works in tandem with REUSE software.
Features:
- REUSE Integration: Detects annotations from
REUSE.toml. - Customizable Defaults: Set default license identifiers and copyright holders.
- Smart Comments: Customizable patterns to set comment syntax based on file names.
- Regex Extraction: Capture existing licenses, years, authors, and contact info using regex.
- Git Integration: Automatically derives the copyright year from the first commit in git history.
Background
Converting the license and copyright in a project to become compliant with SPDX headers is very tedious work, especially if you have many files with different years, copyrights, and licenses.
This program helps to do that by using pattern matching, search, replace, and deletion.
Prerequisites
The following program is needed to build and install the tool:
- Go tools (latest version recommended)
Installation
The following command will build and install the program into your $GOBIN
directory:
$ go install git.sr.ht/~shulhan/spdxconv/cmd/spdxconv@latest
To check the value of $GOBIN, run:
$ go env GOBIN
Usage
Converting to SPDX is a trial-and-error task.
This program does not guarantee that the conversion will succeed in one
cycle.
To help with this, we provide three commands: init, scan, and apply.
The init command creates the spdxconv.cfg configuration in the current
directory.
This configuration file teaches the program how to scan and apply the
license and copyright.
The scan command lists the files that need to be converted or inserted
with SPDX identifiers into a file named spdxconv.report.
Users can then inspect and modify the report to see which files need to
proceed.
The apply command reads spdxconv.report and applies the license and
copyright as stated.
Users can repeat the edit "spdxconv.cfg", scan, and apply commands
multiple times until they are satisfied with the result.
The init command
The first thing to do is to generate the configuration file using:
$ spdxconv init
This create the spdxconv.cfg file in the current directory with the
following content (subject to changes in the future),
[default]
license_identifier =
copyright_year =
file_copyright_text =
max_line_match = 10
[match-file-comment]
pattern = "^.*\\.(adoc|asciidoc|c|cc|cpp|cs|dart|go|h|hh|hpp|java|js|jsx)$"
pattern = "^.*\\.(jsonc|kt|kts|php|rs|sass|scss|swift|ts|tsx)$"
pattern = "^(.*/)?(go.mod|go.work)$"
prefix = "//"
[match-file-comment]
pattern = "^.*\\.(aff|aww|bash|csh|d2|dockerfile|env|gitignore|gitmodules|hcl|ipynb)$"
pattern = "^.*\\.(make|pl|pm|py|ps1|rb|sh|tf|toml|yaml|yml|zsh)$"
pattern = "^(.*/)?([Dd]ockerfile|[Mm]akefile|robots.txt)$"
# systemd.unit(5).
pattern = "^.*\\.(automount|device|mount|path|scope|service|slice|socket|swap|target|timer)$"
prefix = "#"
[match-file-comment]
pattern = "^.*\\.(css)$"
prefix = "/*"
suffix = "*/"
[match-file-comment]
pattern = "^.*\\.(fxml|gohtml|htm|html|html5|kml|markdown|md|xml)$"
prefix = "<!--"
suffix = "-->"
[match-file-comment]
pattern = "^.*\\.(lua|sql)$"
prefix = "--"
[match-file-comment]
pattern = "^.*\\.(rst)$"
prefix = ".."
[match-file-comment]
pattern = "^.*\\.(tex)$"
prefix = "%"
# File name that match with this pattern will have the ".license" file
# created.
[match-file-comment]
pattern = "^.*\\.(apk|app|bz2|exe|gz|tar|tgz|zip)$"
pattern = "^.*\\.(csv|doc|docx|json|pdf|ppt|pptx|xls|xlsx)$"
pattern = "^.*\\.(bmp|gif|ico|jpeg|jpg|png|svg|svgz|webp)$"
pattern = "^.*\\.(3gp|avi|flv|mkv|mp3|mp4|mpeg|mpg|mpg4)$"
pattern = "^.*\\.(acc|ogg|mp3)$"
pattern = "^(.*/)?(go.sum|go.work.sum)$"
[match-license]
pattern = "^(//+|#+|/\\*+|<!--+|--+)?\\s*(.*)governed by a BSD-style(.*)$"
license_identifier = BSD-3-Clause
delete_line_before = "^(//+|#+|/\\*+|<!--+|--+)$"
delete_line_after = "^(//+|#+|/\\*+|<!--+|--+)?\\s*license that can(.*)$"
delete_line_after = "^(//+|#+|\\*+/|--+>|--+)$"
[match-copyright]
pattern = "^(//+|#+|/\\*+|<!--+|--+)?\\s*Copyright\\s+(?<year>\\d{4}),?\\s+(?<author>.*)\\s+<(?<contact>.*)>.*$"
delete_line_before = "^(//+|#+|/\\*+|<!--+|--+)$"
delete_line_after = "^(//+|#+|\\*+/|--+>|--+)$"
The configuration use the ini file format.
You must fill in the [default] section before running other commands.
You can add match-file-comment, match-license and match-copyright
section as required, or modify the existing one to match your use case.
For quick reference, here are several rules that you need to be aware of:
- The regex value must be enclosed in double quotes.
- The backslash '\' character must be escaped. For example, a regex for space "\s" must be written as "\\s".
The next subsection explains the content of configuration file and how it
affects the program during scan and apply.
The default section
This section defines the default license identifier, year, and copyright
text to be inserted into a file if no match-license or match-copyright
found.
The license_identifier sets the default license using one of SPDX license
identifiers from https://spdx.org/licenses/ .
For example, GPL-3.0-only.
The copyright_year sets the default year to be used in
SPDX-FileCopyrightText.
The year can be a single year (for example "2026"), a range of years (for
example, "2000-2026"), or list of years separated by comma (for example,
"2000,2001,2026"); as long as there are no spaces in between.
The file_copyright_text sets the default author and contact in
SPDX-FileCopyrightText.
For example, "John Doe <john.doe@example>".
You should fill the license_identifier, copyright_year, and
file_copyright_text before continue running the program.
The max_line_match defines the number of lines to be searched at the
top and bottom of the file for SPDX-* identifiers, and match-license
pattern, and match-copyright pattern; before the program insert the
default values.
The default value is 10.
The match-file-comment section
The first thing that the program does is detect which comment prefix and suffix to be used when inserting SPDX identifiers.
For each pattern in the "match-file-comment" section, the program will match
it against the file name to get the comment prefix and suffix.
User can add their own "match-file-comment" sections as they like or modify the existing ones.
The "match-file-comment" can have an empty prefix and suffix. That means if the file name matches, it will create new file with a ".license" suffix containing the SPDX identifiers, instead of inserting them into the file directly.
If the file name does not match one of the "match-file-pattern" entries, the file will be flagged as "unknown".
The match-license section
After program detects the file comment syntax to use, it searches for a line that matches with "SPDX-License-Identifier:".
If there is a match at the top or bottom, the scan will stop and continue to processing copyright.
If there is no match, it will search for a line that match with "pattern" regular expression. If a line matches, the value in "match-license::license_identifier" will replace the "default::license_identifier" value.
If "delete_line_before" or "delete_line_after" is defined, it will search for the pattern before and after the matched line and delete it. These can be defined zero or multiple times.
The match-copyright section
The match-copyright section defines the pattern to match old copyright text. The regex must contain named group to capture copyright year, author, and contact.
If no copyright year is found in the file, the program will derive the year from the date of the first commit in the history of the file using the Source Code Management (SCM). In git SCM, it will run "git log --follow file".
For example, given the following old copyright text,
// Copyright 2022, John Doe <john.doe@email>. All rights reserved.
we can capture the year, author, and contact using the following regex,
^//+\\s*Copyright\\s+(?<year>\\d{4}),?\\s+(?<author>.*)\\s+<(?<contact>.*)>.*$"
The match-copyright section can also contain zero or more
delete_line_before and delete_line_after patterns.
The scan command
The scan command scans the files that need to be converted or inserted with SPDX identifiers in the current directory, recursively. The result is stored inside a report file named "spdxconv.report". No other files are modified during and after the scan completed.
Users can inspect and modify the report to exclude certain files to
changes the behaviour of apply command.
Deleting a line in the report means excluding the file from being processed.
The scan command works in the following way,
(0) Skip the file if it is ignored by git or already annotated in the
REUSE.toml configuration.
(1) Check the file for SPDX-License-Identifier and
SPDX-FileCopyrightText.
If both exist, skip the file.
(2) If SPDX-License-Identifier line does not exist, find the old license
using the match-license sections.
For each match-license in the configuration,
(2.1) If there is a match, record it as "match" and its line number into the report.
(2.2) If no match, use the default license from configuration, record it as "default" with "0" as line number in the report.
(3) If SPDX-FileCopyrightText line does not exist, find the old copyright
text using the match-copyright sections.
For each match-copyright in the configuration,
(3.1) If there is a match, get the year, author, and contact; and record it as "match" and its line number into the report.
If the year is empty, try to get the year from the first commit of the file using "git log --follow ..." command. If no commit history or its not using git, use default copyright year from configuration.
(3.2) If there is no match, use default copyright year and text from configuration, and record it as "default" in the report.
The spdxconv.report file format
Each line in the report file is formatted using CSV and has several columns separated by comma,
path "," license_id "," idx_license_id "," year "," copyright_id ","
idx_copyright_id "," comment_prefix "," comment_suffix
where each column has the following values,
path = { unicode_char }
license_id = "default" | "exist" | "match"
idx_license_id = 1 * decimal_digit
year = single_year { "," single_year }
| single_year "-" single_year
single_year = 4 * decimal_digit
copyright_id = "default" | "exist" | "match"
idx_copyright_id = 1 * decimal_digit
The path column defines the path to the file.
The license_id column defines the license identifier to be used.
The value is either,
- default - insert new identifier and using the default license_identifier value from configuration.
- exist - the SPDX-License-Identifier already exist in file, at line number
set in
idx_license_id - match - one of the pattern in match-license found in file at line number
set in
idx_license_id.
The idx_license_id defines the line number in file where license_id is
"exist" or "match".
Positive value means match found at the top, and negative value means match
found at the bottom.
The year column define the copyright year for the work.
The value is either,
- YYYY - single year, for example 2026
- YYYY,YYYY,... - list of year, separated by comma
- YYYY-YYYY - range of years, for example 2000-2026
The copyright_id define the author and contact.
The value is either,
- default - insert new identifier using the default copyright_text value from configuration.
- exist - the SPDX-FileCopyrightText already exist in file, at line number
set in
idx_copyright_id. - match - one of the pattern in the match-copyright found in file at line
number set in
idx_copyright_id.
The idx_copyright_id define the line number in file where copyright_id is
"exist" or "match".
Positive value means match found at the top, and negative value means match
found at the bottom.
The comment_prefix and comment_suffix contains the prefix and suffix
used as comment in the file.
The spdxconv.report file groups
Files are collected into four groups: regular, binary, unknown, and done. Each group is separated by line prefixed with "//spdxconv:" in the report:
//spdxconv:regular
...
//spdxconv:binary
...
//spdxconv:unknown
...
//spdxconv:done
...
Regular group: Files where the program can detect the comment syntax. Program will insert the new SPDX identifiers into the file using the comment syntax.
Binary group: Non-text file, for example images (like jpg, png)
or executable files.
The program will create a separate .license file.
Inside those "$name.license" file, the new SPDX identifiers will be inserted
as defined in the report.
Unknown group: Files where the program cannot detect the comment syntax.
These files will not be processed; they are listed so user can inspect,
modify the configuration, and rerun the scan command again in the next
cycle.
Done group: Files that already have SPDX identifiers. File in regular and binary group that has been applied will be moved here.
The apply command
The apply command reads the spdxconv.report and applies the license and
copyright to the files as stated.
Any failed operations will be logged to stdout.
Once a file from regular or binary group is successfully processed, it will be moved to the done group.
License
This software is licensed under GPL-3.0-only.
See the file LICENSE for full text.
References
-
SPDX License List: Standardized short identifier.
-
REUSE FAQ: Common questions on licensing best practices.