It is fundamentally important to understand the privacy issues if using Pydenticon in order to generate uniquelly identifiable avatars for users leaving the comments etc.
The most common way to expose the identicons is by having a web application generate them on the fly from data that is being passed to it through HTTP GET requests. Those GET requests would commonly include either the raw data, or data as hex string that is then used to generate an identicon. The URLs for GET requests would most commonly be made as part of image tags in an HTML page.
The data passed needs to be unique in order to generate distinct identicons. In most cases the data used will be either name or e-mail address that the visitor posting the comment fills-in in some field. That being said, e-mails usually provide a much better identifier than name (especially if the website verifies the comments through by sending-out e-mails).
Needless to say, in such cases, especially if the website where the comments are being posted is public, using raw data can completely reveal the identity of the user. If e-mails are used for generating the identicons, the situation is even worse, since now those e-mails can be easily harvested for spam purposes. Using the e-mails also provides data mining companies with much more reliable user identifier that can be coupled with information from other websites.
Therefore, it is highly recommended to pass the data to web application that generates the identicons using hex digest only. I.e. never pass the raw data.
Although passing hash instead of real data as part of the GET request is a good step forward, it can still cause problems since the hashses can be collected, and then used in conjunction with rainbow tables to identify the original data. This is particularly problematic when using hex digests of e-mail addresses as data for generating the identicon.
There’s two feasible approaches to resolve this:
- Always apply salt to user-identifiable data before calculating a hex digest. This can hugely reduce the efficiency of brute force attacks based on rainbow tables (althgouh it will not mitigate it completely).
- Instead of hashing the user-identifiable data itself, every time you need to do so, create some random data instead, hash that random data, and store it for future use (cache it), linking it to the original data that it was generated for. This way the hex digest being put as part of an image link into HTML pages is not derived in any way from the original data, and can therefore not be used to reveal what the original data was.
Keep in mind that using identicons will inevitably still allow people to track someone’s posts across your website. Identicons will effectively automatically create pseudonyms for people posting on your website. If that may pose a problem, it might be better not to use identicons at all.
Finally, small summary of the points explained above:
- Always use hex digests in order to retrieve an identicon from a server.
- Instead of using privately identifiable data for generating the hex digest, use randmoly generated data, and associate it with privately identifiable data. This way hex digest cannot be traced back to the original data through brute force or rainbow tables.
- If unwilling to generate and store random data, at least make sure to use salt when hashing privately identifiable data.