This archive is retained to ensure existing URLs remain functional. It will not contain any emails sent to this mailing list after July 1, 2024. For all messages, including those sent before and after this date, please visit the new location of the archive at https://mailman.ripe.net/archives/list/db-wg@ripe.net/

[db-wg] Proposal to allow non-ASCII characters in "org-name:", "person:" and "role:" attributes

Previous message (by thread): [db-wg] Proposal to allow non-ASCII characters in "org-name:", "person:" and "role:" attributes
Next message (by thread): [db-wg] Proceeding with NWI-4

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Edward Shryane eshryane at ripe.net
Tue May 28 18:13:32 CEST 2024

Dear colleagues,

It was pointed out that the ARIN example: whois -h whois.arin.net POC SHRYA12-ARIN

is not correct, and should read: whois -h whois.arin.net "p SHRYA12-ARIN"

(I used "POC" instead of "p" and that could either cause "POC" to be additionally returned, or no objects at all, depending on your whois client).

Apologies,

Ed Shryane
RIPE NCC


> On 28 May 2024, at 11:27, Edward Shryane <eshryane at ripe.net> wrote:
> 
> Dear colleagues,
> 
> There was a question about UTF-8 support by major Whois providers during last week's DB-WG session at RIPE88.
> 
> During the UTF-8 discussion in December I checked the other RIRs as follows:
> 
> LACNIC: only Latin-1 encoded characters are accepted in updates (UTF-8 is ignored) but UTF-8 is returned on port 43.
> Example: whois -h whois.lacnic.net PAP12
> APNIC: only Latin-1 is returned
> Example: whois -h testwhois.apnic.net YYYYMMDD-MNT
> 
> Subsequently I tested the other RIRs to be sure:
> 
> ARIN: UTF-8 is supported in the RPSL object and UTF-8 is returned on port 43.
> Example: whois -h whois.arin.net POC SHRYA12-ARIN
> AFRINIC: UTF-8 characters are accepted in updates and UTF-8 is returned on port 43.
> Example: whois -h whois.afrinic.net SHRYANE-MNT
> 
> RIPE stores Latin-1 and returns Latin-1 on port 43.
> 
> So in summary, 3 RIRs return UTF-8 and 2 RIRs return Latin-1 on port 43.
> 
> Regards
> Ed Shryane
> RIPE NCC
> 
> 
> 
>> On 2 May 2024, at 16:02, Edward Shryane <eshryane at ripe.net> wrote:
>> 
>> Dear colleagues,
>> 
>> To follow-up on the UTF-8 discusssion in January, the DB team plans to implement support for UTF-8 in 3 phases:
>> 
>> (1) Add a flag to allow a client to choose a character set
>> 
>> In the Whois release 1.112, we have added the "-Z / --charset" query flag to allow clients to specify which character set they expect. The server response will encode RPSL objects using that character set.
>> 
>> This new flag can already be tested in the RC environment, e.g. the SHRYANE-MNT object contains "remarks:" attributes with non-ASCII (but still latin-1) characters:
>> 
>>   $ whois -h whois-rc.ripe.net -r shryane-mnt
>>   $ whois -h whois-rc.ripe.net -r -Z utf8 shryane-mnt
>> 
>> This flag has no impact on the default behaviour of the RIPE database. This change only affects port 43, and the default character set remains latin-1.
>> 
>> This flag will already be useful for example, to capture responses as UTF-8 to file or use UTF-8 encoding in your terminal. In future, if the default on port 43 changes to UTF-8, then clients can keep latin-1 by using "-Z/--charset latin1".
>> 
>> (2) Convert the database schema to UTF-8
>> 
>> In the following Whois release, the DB team plans to switch the RIPE database schema character set from latin-1 to UTF-8. This will allow Whois to store UTF-8 strings in the database index tables.
>> 
>> Switching the database schema character set will involve about 1 hour of downtime to Whois updates, and Whois queries will not be affected. We will announce this change in advance.
>> 
>> This change will have no impact on the default behaviour of the RIPE database. All interfaces will behave as before, and RPSL objects will remain latin-1 encoded internally.
>> 
>> (3) Allow UTF-8 to be used in RPSL objects
>> 
>> Once the RIPE database schema supports the UTF-8 character set, the DB team will create a further Whois release that will allow UTF-8 to be used in RPSL objects, in addition to the index tables.
>> 
>> The default behaviour of the RIPE database will remain the same. All interfaces will behave as before, but RPSL objects will use UTF-8 internally.
>> 
>> In future, if the DB-WG decides to allow UTF-8 characters in RPSL, the database will already support it.
>> 
>> Regards
>> Ed Shryane
>> RIPE NCC
>> 
>> 
>>> On 18 Jan 2024, at 10:34, Edward Shryane <eshryane at ripe.net> wrote:
>>> 
>>> Dear colleagues,
>>> 
>>> Based on the discussion regarding UTF-8 in the RIPE database during the interim meeting yesterday, I suggest that we implement support for UTF-8 in the database (i.e. convert the schema and add a flag to allow a client to choose a character set), but we do not allow additional characters for now, pending further DB-WG discussion. Our intention is to lay the groundwork for future support, without breaking existing functionality. If you have any concerns or objections please let me know.
>>> 
>>> We will now prepare an implementation plan / impact analysis of these changes.
>>> 
>>> Regards
>>> Ed Shryane
>>> RIPE NCC
>>> 
>>> 
>>>> On 24 Nov 2023, at 10:03, Edward Shryane via db-wg <db-wg at ripe.net> wrote:
>>>> 
>>>> Dear colleagues,
>>>> 
>>>> Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including:
>>>> 
>>>> * These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries.
>>>> * RPSL names are ASCII according to RFC2622
>>>> * Using a normalised name makes the object easier to query
>>>> * Reading a normalised name is easier to interpret
>>>> 
>>>> However there are some drawbacks to forcing names to only use a subset of ASCII characters:
>>>> 
>>>> * Organisations, roles and persons cannot use their actual name if it includes characters outside this subset.
>>>> * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways.
>>>> 
>>>> Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes.
>>>> 
>>>> Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version.
>>>> 
>>>> Please let me know your feedback.
>>>> 
>>>> Regards
>>>> Ed Shryane
>>>> RIPE NCC
>>>> 
>>>> ---
>>>> 
>>>> Whois attribute verbose description (copied from the help text).
>>>> 
>>>> org-name
>>>> --------
>>>> Specifies the name of the organisation that this organisation object
>>>> represents in the RIPE Database. This is an ASCII-only text attribute.
>>>> The restriction is because this attribute is a look-up key and the
>>>> whois protocol does not allow specifying character sets in queries.
>>>> The user can put the name of the organisation in non-ASCII character
>>>> sets in the "descr:" attribute if required.
>>>> 
>>>> A list of 1 to 30 words separated by white space. 
>>>> A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/-
>>>> A word may have up to 64 characters and is not case sensitive. 
>>>> Each word can have any combination of the above characters with no restriction on the start or end of a word.
>>>> 
>>>> person
>>>> ------
>>>> Specifies the full name of an administrative, technical or zone
>>>> contact person for other objects in the database.
>>>> 
>>>> It should contain 2 to 10 words.
>>>> A word is made up of ASCII alphanumeric characters and additionally: .`'_-
>>>> The first word should begin with a letter.
>>>> At least one other word should also begin with a letter.
>>>> Max 64 characters can be used in each word.
>>>> 
>>>> role
>>>> ----
>>>> Specifies the full name of a role entity, e.g. RIPE DBM.
>>>> 
>>>> A list of 1 to 30 words separated by white space.
>>>> A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/-
>>>> A word may have up to 64 characters and is not case sensitive. 
>>>> Each word can have any combination of the above characters with no restriction on the start or end of a word.
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://mailman.ripe.net/
>>> 
>> 
>

Previous message (by thread): [db-wg] Proposal to allow non-ASCII characters in "org-name:", "person:" and "role:" attributes
Next message (by thread): [db-wg] Proceeding with NWI-4

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[ db-wg Archives ]