authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
莱安德罗拱形门
验证专家 在工程

Leandro is a data scientist and machine learning developer who creates solutions for companies, 包括Profasee和波士顿咨询集团. 他精通TensorFlow、Spark、Python和R语言.

Read More

以前在

Softtek
Share

My development palate has expanded since I learned to appreciate the sweetness found in Python and R. Data science is an art that can be approached from multiple angles but requires a careful balance of language, libraries, 和专业知识. The expansive capabilities of Python and R provide syntactic sugar: syntax that eases our work and allows us to address complex problems with short, 优雅的解决方案.

这些语言为我们探索解决方案空间提供了独特的方式. 每种语言都有自己的优点和缺点. The trick to using each effectively is recognizing which problem types benefit from each tool and deciding how we want to communicate our findings. 每种语言中的语法糖使我们的工作效率更高.

R和Python作为底层代码之上的交互接口, 允许数据科学家使用他们选择的语言进行数据探索, 可视化, 和建模. This interactivity enables us to avoid the incessant loop of editing and compiling code, 这会使我们的工作不必要地复杂化吗.

These high-level languages allow us to work with minimal friction and do more with less code. Each language’s syntactic sugar enables us to quickly test our ideas in a REPL (read-evaluate-print loop), 可以实时执行代码的交互界面. 这种迭代方法是一个关键组成部分 现代数据处理周期.

R vs. Python:富有表现力和专业化

R和Python的强大之处在于它们的表现力和灵活性. 每种语言都有特定的用例,其中它比另一种语言更强大. 另外, each language solves problems along different vectors and with very different types of output. These styles tend to have different developer communities where one language is preferred. 随着每个社区的有机发展, their preferred language and feature sets trend toward unique syntactic sugar styles that reduce the code volume required to solve problems. And as the community and language mature, the language’s syntactic sugar often gets even sweeter.

尽管每种语言都提供了解决数据问题的强大工具集, we must approach those problems in ways that exploit the particular strengths of the tools. R was born as a statistical computing language and has a wide set of tools available for performing statistical analyses and explaining the data. Python and its machine learning approaches solve similar problems but only those that fit into a machine learning model. 我们可以把统计计算和机器学习看作是 数据建模虽然这些学校是高度相互联系的, 它们的起源和数据建模范式是不同的.

R爱统计

R已经发展成为一个提供统计分析的丰富软件包, 线性建模, 和可视化. 因为这些包几十年来一直是R生态系统的一部分, 他们很成熟, efficient, 并且有充分的证据. When a problem calls for a statistical computing approach, R is the right tool for the job.

R受到社区喜爱的主要原因可以归结为:

  • 离散数据操作,计算和过滤方法.
  • 灵活的链接操作符连接这些方法.
  • A succinct syntactic sugar that allows developers to solve complex problems using comfortable statistical 和可视化 methods.

一个带R的简单线性模型

To see just how succinct R can be, let’s create an example that predicts diamond prices. 首先,我们需要数据. 我们将使用 diamonds default dataset, which is installed with R and contains attributes such as color and cut.

我们还将演示R的管道操作符(%>%),相当于Unix命令行管道(|) operator. 这个流行的R语法糖特性是由 Tidyverse包套件. This operator and the resulting code style is a game changer in R because it allows for the chaining of R verbs (i.e.(R函数)来划分和解决一系列问题.

The following code loads the required libraries, processes our data, and generates a linear model:

库(tidyverse)
库(ggplot2)

mode <- function(data) {
  freq <- unique(data)
  freq[which.max(汇总(匹配(数据、频率))))
}

data <- diamonds %>% 
        变异(((.数值),~ replace_na(., median(., na.rm = TRUE)))) %>% 
        变异(((.numeric), scale))  %>%
        突变(在((否定(在哪里.数值)),~ replace_na(.x, mode(.x)))) 

model <- lm(price~., data =数据)

model <- step(model)
总结(模型)
Call:
Lm(公式=价格~克拉+切工+颜色+净度+深度+ 
    表+ x + z, data = data)

Residuals:
    最小1Q中位数3Q最大值 
-5.3588 -0.1485 -0.0460  0.0943  2.6806 

系数:
             估计性病. Error  t value Pr(>|t|)    
(拦截)0.140019   0.002461  -56.892  < 2e-16 ***
克拉1.337607   0.005775  231.630  < 2e-16 ***
cut.L        0.146537   0.005634   26.010  < 2e-16 ***
cut.Q       -0.075753   0.004508  -16.805  < 2e-16 ***
cut.C        0.037210   0.003876    9.601  < 2e-16 ***
削减^ 4 0.005168   0.003101   -1.667  0.09559 .  
color.L     -0.489337   0.004347 -112.572  < 2e-16 ***
color.Q     -0.168463   0.003955  -42.599  < 2e-16 ***
color.C     -0.041429   0.003691  -11.224  < 2e-16 ***
颜色^ 4 0.009574   0.003391    2.824  0.00475 ** 
颜色^ 5 0.024008   0.003202   -7.497 6.64e-14 ***
颜色^ 6 0.012145   0.002911   -4.172 3.02e-05 ***
clarity.L    1.027115   0.007584  135.431  < 2e-16 ***
clarity.Q   -0.482557   0.007075  -68.205  < 2e-16 ***
clarity.C    0.246230   0.006054   40.676  < 2e-16 ***
清晰^ 4 0.091485   0.004834  -18.926  < 2e-16 ***
清晰^ 5 0.058563   0.003948   14.833  < 2e-16 ***
清晰^ 6 0.001722   0.003438    0.501  0.61640    
清晰^ 7 0.022716   0.003034    7.487 7.13e-14 ***
深度0.022984   0.001622  -14.168  < 2e-16 ***
表0.014843   0.001631   -9.103  < 2e-16 ***
x           -0.281282   0.008097  -34.740  < 2e-16 ***
z           -0.008478   0.005872   -1.444  0.14880    
---
Signif. 代码:0 ' ***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.' 0.1 ‘ ' 1

残差标准误差:0.53917自由度上的2833
乘以r²等于0.调整后的r²为0.9198 
f统计量:2.81e+04 on 22 and 53917 DF,  p-value: < 2.2e-16

R makes this linear equation simple to program and understand with its syntactic sugar. 现在,让我们把注意力转移到Python的王者地位.

Python最适合机器学习

Python是一个强大的, 通用语言, 它的主要用户社区之一专注于机器学习, 利用流行的库 scikit-learn, imbalanced-learn, and Optuna. 许多最有影响力的机器学习工具包,比如 TensorFlow, PyTorch, and Jax,主要是为Python编写的.

Python的语法糖是机器学习专家喜欢的魔力, 包括简洁的数据管道语法, 除了scikit-learn的fit-transform-predict模式:

  1. 转换数据,为模型做好准备.
  2. 构建一个模型(隐式或显式).
  3. 适合模型.
  4. 预测新数据(监督模型)或转换数据(无监督模型).
    • 对于监督模型,计算新数据点的误差度量.

The scikit-learn library encapsulates functionality matching this pattern while simplifying programming for exploration 和可视化. There are also many features corresponding to each step of the machine learning cycle, 提供交叉验证, hyperparameter调优, 和管道.

Diamond机器学习模型

现在我们将关注一个使用Python的简单机器学习示例, 和R没有直接的比较. We’ll use the same dataset and highlight the fit-transform-predict pattern in a very tight piece of code.

Following a machine learning approach, we’ll split the data into training and testing partitions. We’ll apply the same transformations on each partition and chain the contained operations with a pipeline. The methods (fit and score) are key examples of powerful machine learning methods contained in scikit-learn:

导入numpy为np
以pd方式导入熊猫
从sklearn.linear_model LinearRegression
从sklearn.Model_selection导入train_test_split
从sklearn.预处理导入StandardScaler
从sklearn.预处理导入OneHotEncoder
从sklearn.导入SimpleImputer
从sklearn.管道进口
从sklearn.组合导入ColumnTransformer
从熊猫.api.类型导入is_numeric_dtype

钻石= SNS.load_dataset(钻石)
钻石=钻石.dropna()

X_train,x_test,y_train,y_test = train_test_split.Drop ("price", axis=1), diamonds["price"], test_size=0.2, random_state = 0)

Num_idx = x_train.Apply (lambda x: is_numeric_dtype(x)).values
Num_cols = x_train.列(num_idx).values
Cat_cols = x_train.列(~ num_idx).values

num_pipeline = Pipeline(steps=[("imputer")), SimpleImputer(策略=“中值”)), ("scaler", StandardScaler ())))
cat_steps = Pipeline(steps=[("imputer")), SimpleImputer(策略=“常数”, fill_value = "失踪")), ("onehot", OneHotEncoder(下降=“第一”, 稀疏= False))))

#数据转换和模型构造器
preprocessor = ColumnTransformer(transformers=[("num")), num_pipeline, num_cols), ("cat", cat_steps, cat_cols)))

mod = Pipeline(steps=[("preprocessor", preprocessor), ("linear", LinearRegression())])

# .符合()调用 .依次为Fit_transform ()
mod.fit (x_train y_train)

# .预测()调用 .依次变换()
mod.预测(x_test)

打印(f"R的平方得分:{mod.分数(x_test y_test):.3f}")

我们可以看到Python中的机器学习过程是多么的流畅. 此外,Python的 sklearn classes help developers avoid leaks and problems related to passing data through our model while also generating structured and production-level code.

R和Python还能做什么?

除了解决统计应用和创建机器学习模型, R和Python擅长报告, APIs, 交互式仪表板, 以及外部低级代码库的简单包含.

开发人员可以用R和Python生成交互式报表, 但是用R开发它们要简单得多. R还支持将这些报表导出为PDF和HTML.

这两种语言都允许数据科学家创建交互式数据应用程序. R和Python使用这些库 Shiny and Streamlit,分别创建这些应用程序.

最后,R和Python都支持到底层代码的外部绑定. This is typically used to inject highly performant operations into a library and then call those functions from within the language of choice. R uses the Rcpp 包,而Python使用 pybind11 包来完成这个任务.

Python和R: Getting sweet Every Day

在我作为数据科学家的工作中,我经常使用R和Python. The key is to understand where each language is strongest and then adjust a problem to fit within an elegantly coded solution.

与客户沟通时, 数据科学家希望用最容易理解的语言来做这件事. Therefore, we must weigh whether a statistical or machine learning presentation is more effective and then use the most suitable programming language.

Python和R都提供了不断增长的语法糖集合, which both simplify our work as data scientists and ease its comprehensibility to others. The more refined our syntax, the easier it is to automate and interact with our preferred languages. I like my data science language sweet, and the 优雅的解决方案 that result are even sweeter.

了解基本知识

  • R比Python好吗?

    R is better than Python for statistical analysis, both in terms of code and presentation.

  • R比Python更容易使用吗?

    在研究集中于统计分析的问题时,R更容易使用, 线性建模, 或可视化.

  • R最常用的是什么?

    R is most used for exploring a solution space using statistical methods and linear models.

  • 我应该使用R还是Python进行机器学习?

    Python在实现机器学习解决方案时非常出色. Having access to many advanced libraries and efficient syntactic sugar eases coding and reduces development time.

  • Python的三个好处是什么?

    Python是一种流行的通用编程语言. 它易于学习,灵活,并且有强大的库支持.

聘请Toptal这方面的专家.
Hire Now
莱安德罗拱形门

莱安德罗拱形门

验证专家 在工程

布宜诺斯艾利斯,阿根廷

自2021年9月15日起成为会员

作者简介

Leandro is a data scientist and machine learning developer who creates solutions for companies, 包括Profasee和波士顿咨询集团. 他精通TensorFlow、Spark、Python和R语言.

Read More
authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

以前在

Softtek

世界级的文章,每周发一次.

订阅意味着同意我们的 隐私政策

世界级的文章,每周发一次.

订阅意味着同意我们的 隐私政策

Toptal开发者

加入总冠军® community.