Jonathan Levin's Axioms

Ideas, Rants, Theories vs Getting Things Done

1:54 PM

Yet Another Idea for GSoC - New Datatypes

Posted by Jonathan |

Well, as most of you may know, Google Summer of Code is nearing and I have seen many
MySQL blogs urging people to contribute ideas or mentor projects.


My Idea
So I'm going to suggest an idea and as the title already mentions, I think it would be a good idea to add some new Datatypes. Specifically, a new Email datatype.

Now I already hear you saying "Email is a string, why would you need a datatype for it?", but hear me out.

So basically, I personally, have been dealing a lot with trying to clean email addresses and customer data so that my company can help their biggest client use this data in their new CRM system.

I thought it would have been a good idea to implement an email datatype inside MySQL that already has the standard regex for emails and has a defined varchar for the standard of emails.
Later, I became aware that there can be different types of emails addresses.
For example, you can have the standard latin a-z 0-9 with . and _ in them.
You can have an email with the european a u o with the 2 dots above them.
You can also have emails with characters like / in them for example 24/7@domain.com (someone that posted on my blog mentioned that one).

To solve this, you could have an Email datatype like this:
Email(latin)
Email(german)
Email(utf-8)


The Benefits
So what would be the benefits of using this kind of datatype?

  • Its out of the box solution for a common problem.
    • MySQL is used by many many websites and most of them record email addresses, so its relevant to them.
    • You wouldn't need to develop something yourself, go look at MySQL forge for snippets, download and use any libarys, even bother to read about it. Its just there provided for you in the database.
  • It helps improve data quality. After you have many tens of thousands of emails, it would have been nice to have more data quality. Thats not to say that you can develop some solution on the front end with Javascript for example, but it would still be nice to just have it in the database.
  • You wouldn't need to write webinars telling people how to optimize or shrink their email field in the database. Maybe even you can store it better on the harddisk for queries if you know it will be used for email address, who knows.

Downside
The downside would be, that you will get a database error if you didn't implement any checks when the user is entering their email address on their website. For this you might want to make your email field a string and worry about it later. But I assure you, later will be for more annoying.


More Options
Now, you can also say "Well, you can also have a datatype for phone". Well, you can.
Maybe you can have a phone datatype like this:
phone(US)
phone(world)

Depends what you want, but I'd like to start with email addresses. Maybe you can throw in there ip(raw), ip(full)...
I saw a webiner from Jay Pipes writing about that to improve query lookup speed specifically for ip addresses.


Conclusion
I think it will be useful for many MySQL users to have an out-of-the-box solution for a very generic problem in the form of new datatypes that come with their own validation rules.

Please write your thoughts in the comments. Your opinion is valued.


Thank you for reading my blog.

If you are part of a big company and would like to recommend to them a list of open-source and cheap commercial alternatives, I have compiled a list that I use for my company.

These options might make your company more competitive in certain markets and might let you react faster to changes.


Here is the list:

Database - DataWarehouse
InfoBright - Commercial DataWarehouse Engine (highly recommended) that runs inside MySQL. review about it

EnterpriseDB - Commercial Enterprise Scale Database that runs inside PostgreSQL

Vertica - A very high-scale, commercial, analytical and column-based Database. (Was made by the man who invented relational databases in 1970s)


BI Suite
Pentaho - Commercial Open-Source BI suite. Has ETL tools and Reporting (like informatica and cognos)


CRM
SugarCRM - Open Source CRM and on-demand CRM

XRMS - Web-based Open Source CRM

OpenCRX - Enterprise Open Source CRM


ESB
Mule - Commercial Open Source Enterprise Service Bus. Connects between many different webservices. (highly recommended)


ERP
OpenBravo - Web-based Open Source ERP


ERP + CRM
OFBiz - Apache "Open-For-Business" Open Source ERP and CRM

Compiere - Open Source ERP and CRM



I hope it helps.

Thank you for reading my blog.

12:05 PM

Pentaho Kettle vs Informatica PowerCenter

Posted by Jonathan |

After having a successful presentation to a major car manufacturer in China, I have decided to upload my presentation so that other people may benefit and recommend it to their bosses.

In the presentation I research Talend, Pentaho, Informatica and Inaport(CRM data importer).

So the presentation is biased to Pentaho (and the Use Cases inside the presentation are from the Pentaho website).

My Presentation

Please enjoy and distribute.

10:55 PM

Open Source ETL tools vs Commerical ETL tools

Posted by Jonathan |

Recently I have been asked by my company to make a case for open-source ETL-data integration tools as an alternative for the commercial data integration tool, Informatica PowerCenter.
So I did a lot of research and I'm going to try my best, considering I have never used the open-source tools nor the commercial one.

I found plenty of information about comparisons between Pentaho Kettle and Talend, which were 2 of the open-source tools I was supposed to research.
Now, without getting in a big arguement (or matt casters posting on my blog), I'd like to attempt to compare the two, very briefly.
And again, this is ONLY from the research I did online and not based on my experience using the tools (since I dont really have any).


Pentaho Kettle vs Talend


Pentaho
Pentaho is a commerical open-source BI suite that has a product called Kettle for data integration.
It uses an innovative meta-driven approach and has a strong and very easy-to-use GUI.
The company started around 2001 (2002 was when kettle was integrated into it).
It has a strong community of 13,500 registered users.
It has a stand-alone java engine that process the jobs and tasks for moving data between many different databases and files.
It can schedule tasks (but you need a schedular for that - cron).
It can run remote jobs on "slave servers" on other machines.
It has data quality features: from its own GUI, writing more customised SQL queries, Javascript and regular expressions.


Talend
Talend is an open-source data integration tool (not a full BI suite).
It uses a code-generating approach. Uses a GUI, but within Eclipse RC.
It started around October 2006
It has a much smaller community then Pentaho but has 2 finance companies supporting it.
It generates java or perl code which you later run on your server.
It can schedule tasks (also with using schedulars like cron).
It has data quality features: from its own GUI, writing more customised SQL queries and Java.


Comparison - (from my understanding)
Pentaho is faster (twice as fast maybe) then Talend.
Pentaho's GUI is easier to use then Talend's GUI and takes less time to learn.


My impression
Pentaho is easier to use because of its GUI.
Talend is more a tool for people who are making already a Java program and want to save lots and lots of time with a tool that generates code for them.



Assuming Pentaho made it to the next round....

Pentaho Kettle vs Informatica

Informatica
Informatica is a very good commercial data integration suite.
It was founded in 1993
It is the market share leader in data integration (Gartner Dataquest)
It has 2600 customers. Of those, there are fortune 100 companies, companies on the Dow Jones and government organization.
The company's sole focus is data integration.
It has quite a big package for enterprises to integrate their systems, cleanse their data and can connect to a vast number of current and legacy systems.
Its very expensive, will require training some of your staff to use it and probably require hiring consultants as well. (I hear Informatica consultants are well paid).
Its very fast and can scale for large systems. It has "Pushdown Optimization" which uses an ELT approach that uses the source database to do the transforming - like Oracle Warehouse Builder.


Comparison
Pentaho's Javascipt is very powerful when writing transformation tasks.
Informatica has many more enterprise features, for example, load balancing between database servers.
Pentaho's GUI requires less training then Informatica.
Penatho doesn't require huge upfront costs as Informatica does. (that part you saw coming, I'm sure)
(edited)Informatica is faster then Pentaho. Infromatica has Pushdown Optimization, but with some tweaking to Pentaho and some knowledge of the source database, you can improve the speed of Pentaho. (also see line below)
(new)You can place Pentaho Kettle on many different servers (as many as you like, its free) and use it as a cluster.
Informatica has much better monitoring tools then Pentaho.


My Impression
Informatica is a really good enterprise ETL suite, but is very big and expensive.
If the system is small enough, I would rather give Pentaho a try and there are many many use cases where big companies used Pentaho (an airport, a hospital..).



Conclusion

I think matt casters said it best when he said:
The flood of open source software is going to wash away the proprietary ones..

If you want to add (or correct) to the information I wrote here, then please consider doing so, as I am still trying to understand these products myself.
Your opinion is valued.



Thank you for reading my blog.

7:09 PM

Computed/Virtual Columns

Posted by Jonathan |

Some days a go I discovered a wonderful thing called computer columns when I stumbling on MS SQL server. There is a free MS SQL server 2005 express that you can download off a Microsoft site. Yes, I know, FREE and from microsoft.

(P.S. for those who didn’t read, Bill Gates is now the 3rd richest man in the world after 13 years of being number one.)

I also read a really good article that explains indexes on computed columns. The benefits of speeding up searches with them and adding business rules. Obviously, the business rules were particularly interesting to me.

Jay Pipes had a similar webinar about the benefits of speeding up searches, but he suggested you do that by creating a new column and placing in it there. His example was, you can speed up a search for “SELECT * FROM Contacts WHERE family_name LIKE ‘%stone’” by making a column that reverses the family name and then you search by “SELECT * FROM Contacts WHERE reverse_family_name LIKE ‘enots%’”. This will save the query to go through all the records to find the results.

So I looked a bit more into it how to developing business rules in computed columns and I found out that I can’t use “If”s but I can use “Case”. I also found out that the values you are looking at must be in the same table and the same row. Otherwise its non-deterministic and cannot work.

I am pretty happy with these computed columns and I consider it to cover most of what I have been looking for. I am also aware that there is some work being done on this to be implemented into MySQL, since I saw some information about it on the MySQL forge.

I would also like to add some business rules in the form of validations for data to keep data quality high. I was thinking something along the lines of creating new data types with some constraints and limits inside them, but that didn’t work out at all in MySQL.

For example, creating an “email” data-type that will have some regular expressions check on it and default size.

I found something else in the land of XML which is kind of what I was looking for, called schematron, but I need to look into it more.

I’m going to learn some more about this when I have time. In the near future I know I have to learn about installing a CRM system that uses MS SQL and I want to see if I can use any computed columns to help with this system. So if I find anything interesting, I'll blog about it.

Thank you for reading my blog

Subscribe