Gabriel Machado

Logo

My personal portfolio

View My GitHub Profile

Welcome to Gabriel-Machado-GM.GITHUB.IO -> My Personal Portfolio

Gabriel Machado - Data Scientist

Skills: Python, SQL, R, Excel, AWS, Big Data Tools (Google Data Studio/Looker, Zoho Analytics, Tableau, Power BI)

Contact: (011) 95459-2057 | gabriel.machado.gmbr@gmail.com | LinkedIn | GitHub/Portfolio

Objective

Seeking opportunities in Data Science with a focus on Extract, Transform, Load (ETL), Data Analysis, Retrieval Augmented Generation (RAG), Business Intelligence, Machine Learning, and architecting AWS solutions.

Education

Relevant Knowledge & Skills

Work Experience

AI Engineer @ G.A.S. Global Actuarial Solution (2023 - Present)

Business Intelligence Specialist @ Farol Do Futuro (2020 - 2023)

Apprentice Electrical Technician @ Enel Brasil S.A. (2019)

Languages

Volunteer Work

ACM - Alphaville (2016 - 2017)

Projects

Data-Driven EEG Band Discovery with Decision Trees

Publication

Developed objective strategy for discovering optimal EEG bands based on signal power spectra using Python. This data-driven approach led to better characterization of the underlying power spectrum by identifying bands that outperformed the more commonly used band boundaries by a factor of two. The proposed method provides a fully automated and flexible approach to capturing key signal components and possibly discovering new indices of brain activity.

EEG Band Discovery

Decoding Physical and Cognitive Impacts of Particulate Matter Concentrations at Ultra-Fine Scales

Publication

Used Matlab to train over 100 machine learning models which estimated particulate matter concentrations based on a suite of over 300 biometric variables. We found biometric variables can be used to accurately estimate particulate matter concentrations at ultra-fine spatial scales with high fidelity (r2 = 0.91) and that smaller particles are better estimated than larger ones. Inferring environmental conditions solely from biometric measurements allows us to disentangle key interactions between the environment and the body.

Bike Study

Talks & Lectures

Publications

  1. Talebi S., Lary D.J., Wijeratne L. OH., and Lary, T. Modeling Autonomic Pupillary Responses from External Stimuli Using Machine Learning (2019). DOI: 10.26717/BJSTR.2019.20.003446
  2. Wijeratne, L.O.; Kiv, D.R.; Aker, A.R.; Talebi, S.; Lary, D.J. Using Machine Learning for the Calibration of Airborne Particulate Sensors. Sensors 2020, 20, 99.
  3. Lary, D.J.; Schaefer, D.; Waczak, J.; Aker, A.; Barbosa, A.; Wijeratne, L.O.H.; Talebi, S.; Fernando, B.; Sadler, J.; Lary, T.; Lary, M.D. Autonomous Learning of New Environments with a Robotic Team Employing Hyper-Spectral Remote Sensing, Comprehensive In-Situ Sensing and Machine Learning. Sensors 2021, 21, 2240. https://doi.org/10.3390/s21062240
  4. Zhang, Y.; Wijeratne, L.O.H.; Talebi, S.; Lary, D.J. Machine Learning for Light Sensor Calibration. Sensors 2021, 21, 6259. https://doi.org/10.3390/s21186259
  5. Talebi, S.; Waczak, J.; Fernando, B.; Sridhar, A.; Lary, D.J. Data-Driven EEG Band Discovery with Decision Trees. Preprints 2022, 2022030145 (doi: 10.20944/preprints202203.0145.v1).
  6. Fernando, B.A.; Sridhar, A.; Talebi, S.; Waczak, J.; Lary, D.J. Unsupervised Blink Detection Using Eye Aspect Ratio Values. Preprints 2022, 2022030200 (doi: 10.20944/preprints202203.0200.v1).
  7. Talebi, S. et al. Decoding Physical and Cognitive Impacts of PM Concentrations at Ultra-fine Scales, 29 March 2022, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-1499191/v1]
  8. Lary, D.J. et al. (2022). Machine Learning, Big Data, and Spatial Tools: A Combination to Reveal Complex Facts That Impact Environmental Health. In: Faruque, F.S. (eds) Geospatial Technology for Human Well-Being and Health. Springer, Cham. https://doi.org/10.1007/978-3-030-71377-5_12
  9. Wijerante, L.O.H. et al. (2022). Advancement in Airborne Particulate Estimation Using Machine Learning. In: Faruque, F.S. (eds) Geospatial Technology for Human Well-Being and Health. Springer, Cham. https://doi.org/10.1007/978-3-030-71377-5_13
from pyspark.sql import functions as F
from pyspark.sql import Window
from graphframes import GraphFrame

# --- CONFIGURAÇÕES DE TUNING (Carga de Bilhões) ---
spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.conf.set("spark.sql.shuffle.partitions", "2000") # Calibrado para ~96 cores (6 workers x 16)
spark.conf.set("spark.graphframes.useLocalCheckpoints", "true")
spark.sparkContext.setCheckpointDir("/dbfs/checkpoints/id_resolution_v3")

def generate_native_uuid_v7(timestamp_col):
    """
    Gera UUID v7 usando funções nativas do Spark SQL para máxima performance.
    Lógica: [48 bits timestamp] + [4 bits version 7] + [12 bits rand A] + [2 bits variant 2] +
    """
    # Converter data para Unix Timestamp em milisegundos (48 bits)
    ts_ms = (F.unix_timestamp(timestamp_col) * 1000).cast("long")
    
    # Gerar partes aleatórias (rand() retorna double , transformamos em long)
    rand_a = (F.rand() * 4095).cast("long") # 12 bits
    rand_b = (F.rand() * 4611686018427387903).cast("long") # 62 bits
    
    # Composição Hexadecimal para formar a string do UUID (8-4-4-4-12)
    # Parte 1: timestamp_ms (primeiros 32 bits + hífem + próximos 16 bits)
    part_1 = F.hex(F.shiftRight(ts_ms, 16))
    part_2 = F.hex(F.bitwise_and(ts_ms, F.lit(0xFFFF)))
    
    # Parte 3: Version 7 + Rand A (4 bits '7' + 12 bits random)
    part_3 = F.hex(F.bitwise_or(F.lit(0x7000), rand_a))
    
    # Parte 4: Variant 2 + Rand B (2 bits '10' + 62 bits random)
    # 0x8000 representa a variante RFC 4122/9562 (10xx)
    part_4_binary = F.bitwise_or(F.lit(0x8000000000000000), rand_b)
    part_4_hex = F.hex(part_4_binary)
    
    return F.lower(F.concat(
        F.substring(part_1, 5, 8), F.lit("-"),
        F.substring(part_2, 13, 4), F.lit("-"),
        F.substring(part_3, 13, 4), F.lit("-"),
        F.substring(part_4_hex, 1, 4), F.lit("-"),
        F.substring(part_4_hex, 5, 12)
    ))

def optimized_pipeline(df_huge):
    # 1. Isolamento de Metadados (Redução de custo de Shuffle)
    # Pegamos apenas o estritamente necessário para o grafo
    df_keys = df_huge.select("contract_id_orig", "contract_id_rastreio", "opening_date").distinct()
    
    # 2. Resolução de Grafo (Connected Components)
    edges = df_keys.filter("contract_id_orig!= contract_id_rastreio") \
                  .select(F.col("contract_id_rastreio").alias("src"), 
                            F.col("contract_id_orig").alias("dst")).distinct()
    
    vertices = df_keys.select(F.col("contract_id_orig").alias("id")).union(
               df_keys.select(F.col("contract_id_rastreio").alias("id"))).distinct()
    
    g = GraphFrame(vertices, edges)
    # Otimização: O algoritmo de CC retorna um ID de componente estável (long)
    mapping = g.connectedComponents()

    # 3. Agregação Temporal (Encontrar a "Raiz" cronológica de cada vida)
    # Precisamos da opening_date mais antiga para ancorar o UUID v7
    root_mapping = df_keys.join(mapping, df_keys.contract_id_orig == mapping.id) \
                         .groupBy("component") \
                         .agg(F.min("opening_date").alias("root_opening_date"))
    
    # 4. Geração de UUID v7 via Spark SQL Nativo
    uuid_mapping = root_mapping.withColumn("global_uuid_v7", generate_native_uuid_v7("root_opening_date")) \
                              .select("component", "global_uuid_v7")
    
    # 5. Mapeamento Final (Original_ID -> Global_UUID)
    final_id_map = mapping.join(F.broadcast(uuid_mapping), "component") \
                         .select(F.col("id").alias("orig_id_ref"), "global_uuid_v7")
    
    return final_id_map

# Execução: mapping_table conterá a relação entre todos os IDs da linhagem e o seu UUID v7
mapping_table = optimized_pipeline(df_huge)

# 6. Join Final Massivo
# Aplicamos Liquid Clustering no salvamento para otimizar futuras leituras de ML
df_final = df_huge.join(F.broadcast(mapping_table), df_huge.contract_id_orig == mapping_table.orig_id_ref, "left") \
                 .drop("orig_id_ref")

df_final.write.format("delta") \
       .mode("overwrite") \
       .clusterBy("global_uuid_v7") \
       .saveAsTable("gold_contracts_unified")