项目描述
PyTorch 数据集;在部署自动编码器是一种旨在重建给定输入的神经网络。它通过学习数据的最显著特征来学习重建输入。这些显著特征被编码在潜在空间中,即比原始特征空间维度更低的特征表示。我们可以将自动编码器的潜在代码用于下游任务,例如分类、回归和聚类。在这个简单的例子中,我们使用文本的潜在代码表示通过 k-Means 算法进行聚类。这项工作的目的不是达到最先进的性能,而是展示来自自动编码器的潜在代码可用于下游任务,就像我们如何使用来自主成分分析、线性判别分析和局部线性嵌入的特征一样的ag-news-ae-clustering项目是pip安装pt-datasets时很吃力,特此记录。
概述
此存储库旨在让您更轻松、更快速地访问常用的基准数据集。使用此存储库,您可以以现成的方式将数据集加载到 PyTorch 模型中。此外,它还可用于加载使用 PCA、t-SNE 或 UMAP 编码的上述数据集的低维特征。
数据集
- MNIST
- Fashion-MNIST
- EMNIST-Balanced
- CIFAR10
- SVHN
- MalImg
- AG News
- IMDB
- Yelp
- 20 Newsgroups
- KMNIST
- Wisconsin Diagnostic Breast Cancer
用法
建议使用虚拟环境来隔离项目依赖关系。
$ virtualenv env --python = python3 # 我们使用 python 3
$ pip install pt-datasets # 安装包
然后我们可以使用该包加载现成的数据加载器,
from pt_datasets import load_dataset, create_dataloader
# 加载训练和测试数据
train_data , test_data = load_dataset ( name = "cifar10" )
# 为训练数据创建数据加载器
train_loader = create_dataloader (
dataset = train_data , batch_size = 64 , shuffle = True , num_workers = 1
)
使用数据加载器训练
模型. fit ( train_loader , epochs = 10 )
我们还可以将数据集特征编码到低维空间,
import seaborn as sns
import matplotlib.pyplot as plt
from pt_datasets import load_dataset, encode_features
# load the training and test data
train_data, test_data = load_dataset(name="fashion_mnist")
# get the numpy array of the features
# the encoders can only accept np.ndarray types
train_features = train_data.data.numpy()
# flatten the tensors
train_features = train_features.reshape(
train_features.shape[0], -1
)
# get the labels
train_labels = train_data.targets.numpy()
# get the class names
classes = train_data.classes
# encode training features using t-SNE
encoded_train_features = encode_features(
features=train_features,
seed=1024,
encoder="tsne"
)
# use seaborn styling
sns.set_style("darkgrid")
# scatter plot each feature w.r.t class
for index in range(len(classes)):
plt.scatter(
encoded_train_features[train_labels == index, 0],
encoded_train_features[train_labels == index, 1],
label=classes[index],
edgecolors="black"
)
plt.legend(loc="upper center", title="Fashion-MNIST classes", ncol=5)
plt.show()
pip install pt-datasets 报错: error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [47 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-37
creating build\lib.win-amd64-cpython-37\MulticoreTSNE
copying MulticoreTSNE\__init__.py -> build\lib.win-amd64-cpython-37\MulticoreTSNE
creating build\lib.win-amd64-cpython-37\MulticoreTSNE\tests
copying MulticoreTSNE\tests\test_base.py -> build\lib.win-amd64-cpython-37\MulticoreTSNE\tests
copying MulticoreTSNE\tests\__init__.py -> build\lib.win-amd64-cpython-37\MulticoreTSNE\tests
running egg_info
writing MulticoreTSNE.egg-info\PKG-INFO
writing dependency_links to MulticoreTSNE.egg-info\dependency_links.txt
writing requirements to MulticoreTSNE.egg-info\requires.txt
writing top-level names to MulticoreTSNE.egg-info\top_level.txt
reading manifest file 'MulticoreTSNE.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE.txt'
writing manifest file 'MulticoreTSNE.egg-info\SOURCES.txt'
running build_ext
cmake version 3.18.4
CMake suite maintained and supported by Kitware (kitware.com/cmake).
-- Building for: NMake Makefiles
-- The CXX compiler identification is unknown
CMake Error at CMakeLists.txt:1 (PROJECT):
The CMAKE_CXX_COMPILER:
cl
is not a full path and was not found in the PATH.
To use the NMake generator with Visual C++, cmake must be run from a shell
that can use the compiler cl from the command line. This environment is
unable to invoke the cl compiler. To fix this problem, run cmake from the
Visual Studio Command Prompt (vcvarsall.bat).
Tell CMake where to find the compiler by setting either the environment
variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
to the compiler, or to the compiler name if it is in the PATH.
-- Configuring incomplete, errors occurred!
See also "C:/Users/Administrator/AppData/Local/Temp/pip-install-1wqz9ixa/multicoretsne_e640b7e6100b44f487a8475a7e691c2e/build/temp.win-amd64-cpython-37/Release/CMakeFiles/
CMakeOutput.log".
See also "C:/Users/Administrator/AppData/Local/Temp/pip-install-1wqz9ixa/multicoretsne_e640b7e6100b44f487a8475a7e691c2e/build/temp.win-amd64-cpython-37/Release/CMakeFiles/
CMakeError.log".
ERROR: Cannot generate Makefile. See above errors.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for MulticoreTSNE
### 解决方案:
在 Windows 上安装需要 C++ 编译的 Python 包时,常见此类问题。以下是分步解决方案:
问题原因
错误显示 MulticoreTSNE
需要 C++ 编译器 (Visual Studio 的 cl.exe
),但你的环境未正确配置。该包依赖 CMake 和 Visual Studio 构建工具。
解决方案
1. 安装 Visual Studio 构建工具
- 下载并安装 Microsoft Visual Studio Build Tools。
- 安装时勾选:
- C++ 生成工具
- Windows 10 SDK(根据你的系统版本选择)
- MSVC v142 – VS 2019 C++ 生成工具(或最新版本)
2. 使用 Visual Studio 命令提示符
- 搜索并打开 “Developer Command Prompt for VS”(确保是管理员权限)。
- 在命令提示符中运行:
pip install pt-datasets
报错:ERROR: Could not build wheels for MulticoreTSNE, which is required to install
解决方案:
去pypi官网:网址
搜索 MulticoreTSNE
下载解压:
将该文件放入难度虚拟环境当中,这个时候发现没有报错了。就看看呢个否正常运行
运行之后如果出现:Cannot find/open tsne_multicore shared library。那么该方法宣告失败!。
参考:https://blog.csdn.net/m0_45924886/article/details/133122422