I'm an experienced Analytics/Data Engineer. I specialize in Python (OOP, Data packages such as Pandas, Numpy, Scikit-learn and etc), Spark and other big data tools. I'm currently a Data Engineering fellow at Insight Data Science and i took up MS in Urban Informatics (Applied Data Science) in New York University, where i studied Deep Learning under Yann LeCun and Natural Language Processing/Understanding.
● Developed Tw0rdz, a dashboard for identifying user grouping patterns in Twitch, using users specific vocabulary or slangs. ● Scraped Twitch video metadata and chatlogs json files, ingested with Kafka and stored as Parquet in S3-Delta Lake. ● Extracted slangs with a knowledge base using Spark and tuned the Spark cluster’s memory parameters (GC and offHeap), RPC parameters, shuffling parameters and other resource allocation parameters to solve Spark job failures. ● Performed TF-IDF to remove mentions and Jaccardian Similarity and Levenstein distance to group similar words. ● Developed a visualization datamart on Redshift and automated data infrastructure processes with Airflow.
● Led the automation project of the analytics and operations team processes using Powershell and VBS, reducing turnaround time by 80%, team headcount requirements by 2 - 3 members and cost reductions by atleast PHP 2 million. ● Improved a screening program with n-gram similarity and fuzzy matching, thus increasing accuracy from 91% to 95% and reducing runtime from 3 hours to 30 minutes utilizing in-memory processing and data storage in HDF5 file format. ● Refactored ETL jobs of the analytics team data mart dimension tables to unify independent data quality practices. ● Collaborated with credit risk, fraud, marketing, product and operations teams to understand their analytics requirements and developed periodic and ad-hoc statistics, dashboards, reports and other data products.
● Led consulting teams with 2 – 4 technical consultants on server migrations and an analytics system enhancement project, collaborated with client teams on project delivery, requirements and scope and provided technical implementation strategy. ● Optimized a telecommunication company’s Enterprise Data Warehouse by refactoring Hive codebase and Hadoop system parameters that sped up a job’s runtimes from 4 - 8 hours to 20 – 30 minutes and reducing the amount of ETL jobs. ● Involved in five SAS Visual Analytics proof of concept projects, from developing ingestion pipelines from customer source systems to HDFS, designing visualization data marts, creation of client specific reports and dashboards and performing client presentation and advising product best practices to clients, which all led to product adoption. ● Performed QA and refactoring of a bank’s credit risk’s data warehouse and ETL pipelines due to regulatory changes. ● Conducted training on SAS programming (Prog1&2,Proc SQL, Macro1,2&3) to clients and onboarding consultants.